Introduction

The goal of this project is to use the tools we have learned in class to predict a categorical variable and a numeric one using classification and regression ideas. The project is divided into two sub parts for each prediction: the first parts is the interpretation, where the main goal is to understand the data and how the variables relate to each other (with the use of statistical tools); and the other part is the prediction, where we try to maximize the accuracy of our model by reducing the cost of it (with the use of more complex and flexible methods such as machine learning).

First and foremost, the cost of the errors in this data set has been invented, therefore they do not reflect the reality. Moreover, some code has been commented such as some models because they take too long to compute, but the results have been added. Otherwise, R cannot perform the conversion to HTML. With this said, let’s start.

Initial Set-Up

workingDirectory = ""
setwd(workingDirectory)

loadLibraries = function(libraries) {
    for(lib in libraries) {
        eval(parse(text = sprintf("library(%s)", lib)))
        #install.packages(lib)
        #trycatch
    }
}

# All libraries that we are going to use.
libraries = c("VIM", "rlang", "ggplot2", "caret", "tidyverse", "MASS", "e1071", 
              "GGally", "tm", "SnowballC", "naivebayes", "skimr", "mice", 
              "glmnet", "rpart", "pROC", "class", "randomForest", "dplyr", 
              "factoextra", "foreach", "doParallel", "rpart.plot", "ggord", "fastDummies",
              "ellipse")

loadLibraries(libraries)

rm(list = ls())

Get the data

The data set that we are going to be using in this project has been picked from kaggle (https://www.kaggle.com/mojtaba142/hotel-booking). The data is about two hotels (one from a city hotel and the other from a resort hotel) and their customers; the goal of this project is to predict if a customer will cancel the reservation or not given their booked information.

hotel.df = read.csv("hotel_booking.csv", stringsAsFactors = TRUE)

Fist of all, we need to delete these variables (customer data) as they are invented by the creator and we do not really care much about them.

colsToRemove = c("name", "email", "phone.number", "credit_card")
hotel.df = hotel.df[, -which(names(hotel.df) == colsToRemove)]

rm(colsToRemove)

Columns description

Now, let’s take a look at the data and see what we are working with.

summary(hotel.df)
##           hotel        is_canceled       lead_time   arrival_date_year
##  City Hotel  :79330   Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Resort Hotel:40060   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##                       Median :0.0000   Median : 69   Median :2016     
##                       Mean   :0.3704   Mean   :104   Mean   :2016     
##                       3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                       Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                       
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August :13877      Min.   : 1.00            Min.   : 1.0             
##  July   :12661      1st Qu.:16.00            1st Qu.: 8.0             
##  May    :11791      Median :28.00            Median :16.0             
##  October:11160      Mean   :27.17            Mean   :15.8             
##  April  :11089      3rd Qu.:38.00            3rd Qu.:23.0             
##  June   :10939      Max.   :53.00            Max.   :31.0             
##  (Other):47873                                                        
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies                 meal          country     
##  Min.   : 0.0000   Min.   : 0.000000   BB       :92310   PRT    :48590  
##  1st Qu.: 0.0000   1st Qu.: 0.000000   FB       :  798   GBR    :12129  
##  Median : 0.0000   Median : 0.000000   HB       :14463   FRA    :10415  
##  Mean   : 0.1039   Mean   : 0.007949   SC       :10650   ESP    : 8568  
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000   Undefined: 1169   DEU    : 7287  
##  Max.   :10.0000   Max.   :10.000000                     ITA    : 3766  
##  NA's   :4                                               (Other):28635  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :56477   Corporate: 6677      Min.   :0.00000  
##  Offline TA/TO:24219   Direct   :14645      1st Qu.:0.00000  
##  Groups       :19811   GDS      :  193      Median :0.00000  
##  Direct       :12606   TA/TO    :97870      Mean   :0.03191  
##  Corporate    : 5295   Undefined:    5      3rd Qu.:0.00000  
##  Complementary:  743                        Max.   :1.00000  
##  (Other)      :  239                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                A      :85994     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                D      :19201     
##  Median : 0.00000       Median : 0.0000                E      : 6535     
##  Mean   : 0.08712       Mean   : 0.1371                F      : 2897     
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                G      : 2094     
##  Max.   :26.00000       Max.   :72.0000                B      : 1118     
##                                                        (Other): 1551     
##  assigned_room_type booking_changes       deposit_type        agent       
##  A      :74053      Min.   : 0.0000   No Deposit:104641   Min.   :  1.00  
##  D      :25322      1st Qu.: 0.0000   Non Refund: 14587   1st Qu.:  9.00  
##  E      : 7806      Median : 0.0000   Refundable:   162   Median : 14.00  
##  F      : 3751      Mean   : 0.2211                       Mean   : 86.69  
##  G      : 2553      3rd Qu.: 0.0000                       3rd Qu.:229.00  
##  C      : 2375      Max.   :21.0000                       Max.   :535.00  
##  (Other): 3530                                            NA's   :16340   
##     company       days_in_waiting_list         customer_type  
##  Min.   :  6.0    Min.   :  0.000      Contract       : 4076  
##  1st Qu.: 62.0    1st Qu.:  0.000      Group          :  577  
##  Median :179.0    Median :  0.000      Transient      :89613  
##  Mean   :189.3    Mean   :  2.321      Transient-Party:25124  
##  3rd Qu.:270.0    3rd Qu.:  0.000                             
##  Max.   :543.0    Max.   :391.000                             
##  NA's   :112593                                               
##       adr          required_car_parking_spaces total_of_special_requests
##  Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
##  1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
##  Median :  94.58   Median :0.00000             Median :0.0000           
##  Mean   : 101.83   Mean   :0.06252             Mean   :0.5714           
##  3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
##  Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
##                                                                         
##  reservation_status reservation_status_date
##  Canceled :43017    2015-10-21:  1461      
##  Check-Out:75166    2015-07-06:   805      
##  No-Show  : 1207    2016-11-25:   790      
##                     2015-01-01:   763      
##                     2016-01-18:   625      
##                     2015-07-02:   469      
##                     (Other)   :114477

Missing values

aggr(hotel.df, numbers = TRUE, sortVars = TRUE, labels = names(hotel.df), cex.axis = .5, gap = 1)

## 
##  Variables sorted by number of missings: 
##                        Variable        Count
##                         company 9.430689e-01
##                           agent 1.368624e-01
##                        children 3.350364e-05
##                           hotel 0.000000e+00
##                     is_canceled 0.000000e+00
##                       lead_time 0.000000e+00
##               arrival_date_year 0.000000e+00
##              arrival_date_month 0.000000e+00
##        arrival_date_week_number 0.000000e+00
##       arrival_date_day_of_month 0.000000e+00
##         stays_in_weekend_nights 0.000000e+00
##            stays_in_week_nights 0.000000e+00
##                          adults 0.000000e+00
##                          babies 0.000000e+00
##                            meal 0.000000e+00
##                         country 0.000000e+00
##                  market_segment 0.000000e+00
##            distribution_channel 0.000000e+00
##               is_repeated_guest 0.000000e+00
##          previous_cancellations 0.000000e+00
##  previous_bookings_not_canceled 0.000000e+00
##              reserved_room_type 0.000000e+00
##              assigned_room_type 0.000000e+00
##                 booking_changes 0.000000e+00
##                    deposit_type 0.000000e+00
##            days_in_waiting_list 0.000000e+00
##                   customer_type 0.000000e+00
##                             adr 0.000000e+00
##     required_car_parking_spaces 0.000000e+00
##       total_of_special_requests 0.000000e+00
##              reservation_status 0.000000e+00
##         reservation_status_date 0.000000e+00

We can see that there are some missing values. so we will see if they are mainly one column that we can delete or just rows.

length(which(!complete.cases(hotel.df)))
## [1] 119173

Take a look that there are not that many in comparison with the total number of rows, so we could just delete those customers.

hist(rowMeans(is.na(hotel.df)), xlab = c("Missing values average by rows"), main = c())

Here, we can see that the rows are not really that empty, so let’s look at the columns.

indexesEmptyCols = which(colMeans(is.na(hotel.df)) != 0)

colsWithNA = sort(colMeans(is.na(hotel.df[, indexesEmptyCols])), 
                  decreasing = TRUE)

barplot(colsWithNA, las=2)

The columns company and agent are the ones that are most empty, so let’s remove them.

indexesColsToRemove = c()
for (i in names(colsWithNA)[1:2]) {
    indexesColsToRemove = c(indexesColsToRemove, which(names(hotel.df) == i))
}

hotel.df = hotel.df[, -indexesColsToRemove]

Here, we clear our working environment.

rm(list = setdiff(ls(), "hotel.df"))

Let’s take a look at how many missing values there are left.

length(which(!complete.cases(hotel.df)))
## [1] 4

There are only 4, so let’s remove them.

hotel.df = hotel.df[-which(is.na(hotel.df[, 11]) == T), ]

Now, there are sometimes that the missing values are in the factor variables and are not clearly visible. So let’s take a look at the factor variables and see if there are any.

hotel.df = droplevels.data.frame(hotel.df)

colsWithEmptyFactors.df = data.frame(colName = character(), 
                                     index = integer(), 
                                     naNumber = integer(), 
                                     stringsAsFactors = FALSE)

for (i in 1:ncol(hotel.df)) {
    levels = levels(hotel.df[, i])
    if ("" %in% levels) {
        colsWithEmptyFactors.df = rbind(colsWithEmptyFactors.df, 
                                        data.frame(colName = names(hotel.df)[i], 
                                                   index = i, naNumber = 0,
                                                   stringsAsFactors = FALSE))
    }
}
for (i in 1:nrow(colsWithEmptyFactors.df)) {
    index = colsWithEmptyFactors.df$index[i]
    
    level = levels(hotel.df[, index])
    level[which(level == "")] = NA
    levels(hotel.df[, index]) = level
    
    colsWithEmptyFactors.df[i, 3] = length(which(is.na(hotel.df[, index]) == TRUE))
}

print(colsWithEmptyFactors.df)
##   colName index naNumber
## 1 country    14      488

We see that there were hidden! But they are just 488 compared to the total 110.000. So let’s just remove them.

indexesEmptyRows = which(rowMeans(is.na(hotel.df))!= 0)

hotel.df = hotel.df[-indexesEmptyRows,]

print(length(which(is.na(hotel.df) == TRUE)))
## [1] 0

After cleaning the data set, let’s start with some feature analysis and maybe perform a new feature that will be helpful for our models.

rm(list = setdiff(ls(), "hotel.df"))

Feature Analysis

glimpse(hotel.df)
## Rows: 118,898
## Columns: 30
## $ hotel                          <fct> Resort Hotel, Resort Hotel, Resort Hote…
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month             <fct> July, July, July, July, July, July, Jul…
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,…
## $ country                        <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,…
## $ market_segment                 <fct> Direct, Direct, Direct, Corporate, Onli…
## $ distribution_channel           <fct> Direct, Direct, Direct, Corporate, TA/T…
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, …
## $ assigned_room_type             <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, …
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <fct> No Deposit, No Deposit, No Deposit, No …
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <fct> Transient, Transient, Transient, Transi…
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status             <fct> Check-Out, Check-Out, Check-Out, Check-…
## $ reservation_status_date        <fct> 2015-07-01, 2015-07-01, 2015-07-02, 201…

At first sight, we see a ton of variables that need to be fixed and tweaked. So let’s start defining a function that will take care for most of the work.

REMOVE = "remove"
TO_FACTOR = "toFactor"
TO_INTEGER = "toInteger"
TO_LOGIC = "toLogic"
TO_CHARACTER = "toCharacter"

editColumn = function(colNames, df, toDo) {
    index = which(names(df) == colNames)

    if (!is_empty(index)) {
        if (toDo == REMOVE) {
            df = df[, -index]
        } else if (toDo == TO_FACTOR) {
            df[, index] = as.factor(df[, index])
        } else if (toDo == TO_INTEGER) {
            df[, index] = as.integer(df[, index])
        } else if (toDo == TO_LOGIC) {
            df[, index] = as.logical(df[, index])
        } else if (toDo == TO_CHARACTER) {
            df[, index] = as.character(df[, index])
        } else { errorCondition("Could not perform that action.")}
    }
    
    return(df)
}

In the data set, we can clearly see that the our variable that we want to predict is not a very descriptive, so let’s solve this.

hotel.df = editColumn(c("children"), hotel.df, TO_INTEGER)

hotel.df = editColumn(c("is_canceled"), hotel.df, TO_LOGIC)
hotel.df = editColumn(c("is_canceled"), hotel.df, TO_FACTOR)
levels(hotel.df$is_canceled) = c("no", "yes")

hotel.df = droplevels.data.frame(hotel.df)

summary(hotel.df$is_canceled)
##    no   yes 
## 74745 44153

Now, let’s change the arrival_date_month to a number for better date representation.

levels(hotel.df$arrival_date_month)
##  [1] "April"     "August"    "December"  "February"  "January"   "July"     
##  [7] "June"      "March"     "May"       "November"  "October"   "September"
months.df = data.frame(months = c("January", "February", "March", "April", 
                                  "May", "June", "July", "August", "September",
                                  "October", "November", "December"), number = 1:12)

changeMonthToNumber = function(x) {
  index = which(x == months.df$months)
  
  if (!is_empty(index)) {
    return(months.df$number[index])
  }
  
  return(0)
}


hotel.df = editColumn(c("arrival_date_month"), hotel.df, TO_CHARACTER)
dim(hotel.df$arrival_date_month) = length(hotel.df$arrival_date_month)

hotel.df$arrival_date_month = apply(hotel.df$arrival_date_month, 1, changeMonthToNumber)

summary(hotel.df$arrival_date_month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   7.000   6.553   9.000  12.000

And we will do the same thing we did to the is_canceled variable to the is_repeated_guest as they are the same type.

hotel.df = editColumn(c("is_repeated_guest"), hotel.df, TO_LOGIC)
hotel.df = editColumn(c("is_repeated_guest"), hotel.df, TO_FACTOR)
levels(hotel.df$is_repeated_guest) = c("no", "yes")

summary(hotel.df$is_repeated_guest)
##     no    yes 
## 115092   3806

Looking at the reservation_status_date, we see that it is a date, but it is not in a very clear format. So let’s change it and make three columns, one for the day, one for the month, and the last for the year.

hotel.df = editColumn(c("reservation_status_date"), hotel.df, TO_CHARACTER)
dim(hotel.df$reservation_status_date) = length(hotel.df$reservation_status_date)

separateDate = function(x) {
  return(as.integer(unlist(strsplit(x, "-"))))
}

zero.vector = rep(0, nrow(hotel.df))

reservationDate.df = data.frame(reservation_status_date_year = zero.vector, 
                                reservation_status_date_month = zero.vector, 
                                reservation_status_date_day = zero.vector)

hotel.df = cbind(hotel.df, reservationDate.df)


hotel.df[, (ncol(hotel.df) - 2):(ncol(hotel.df))] = t(apply(hotel.df$reservation_status_date, 1, separateDate))

hotel.df = hotel.df[, -(ncol(hotel.df) - 3)]

And clean the environment.

rm(list = setdiff(ls(), "hotel.df"))

We can detect that the reservation_status is strongly correlated with the is_canceled variable that we are trying to predict, so let’s take a look if that is really the case.

ggplot(data = hotel.df, aes(x = reservation_status, y = is_canceled)) + geom_boxplot()

Here we can clearly see that we were right. Both variables are pretty much the same. Therefore we will delete it and the date according to those variables as they are not meaningful anymore.

hotel.df = hotel.df[, -((ncol(hotel.df) - 3):ncol(hotel.df))]

And, finally let’s make our variable that we want to predict our first one for the shake of simplicity.

colOrder = 1:ncol(hotel.df)
colOrder[2] = 1
colOrder[1] = 2

hotel.df = hotel.df[, colOrder]

Variable Similarity

Here, we will study the relation of each variable with respect to each other and see if there are any variables worth removing or worth taking more importance.

First, let’s create the dummy variables for our data set. For this, we will remove the countries as it has a lot of factors.

hotel.df = hotel.df[, -which(names(hotel.df) == "country")]

Now let’s look at the reserved_room_type and the assigned_room_type, because it will give us some errors afterwards with the models.

(which(hotel.df$reserved_room_type == "P"))
## [1] 72478 72479
(which(hotel.df$assigned_room_type == "P"))
## [1] 72478 72479

Here we see that there is a room type that only two customers have selected so let’s delete it because otherwise it will break our models.

hotel.df = hotel.df[-which(hotel.df$reserved_room_type == "P"),]

hotel.df = droplevels.data.frame(hotel.df)

After this, let’s use the Fast Dummy library in order to create the dummy variables. But, let’s keep the is_canceled column as a factor as it is the one that we want to predict.

indexFactors = names(which(sapply(hotel.df, is.factor) == TRUE)[])
indexFactors = indexFactors[-1]

hotel.df = dummy_cols(hotel.df, select_columns = indexFactors, remove_selected_columns = TRUE)

summary(hotel.df)
##  is_canceled   lead_time     arrival_date_year arrival_date_month
##  no :74745   Min.   :  0.0   Min.   :2015      Min.   : 1.000    
##  yes:44151   1st Qu.: 18.0   1st Qu.:2016      1st Qu.: 4.000    
##              Median : 69.0   Median :2016      Median : 7.000    
##              Mean   :104.3   Mean   :2016      Mean   : 6.553    
##              3rd Qu.:161.0   3rd Qu.:2017      3rd Qu.: 9.000    
##              Max.   :737.0   Max.   :2017      Max.   :12.000    
##  arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
##  Min.   : 1.00            Min.   : 1.0              Min.   : 0.0000        
##  1st Qu.:16.00            1st Qu.: 8.0              1st Qu.: 0.0000        
##  Median :28.00            Median :16.0              Median : 1.0000        
##  Mean   :27.17            Mean   :15.8              Mean   : 0.9289        
##  3rd Qu.:38.00            3rd Qu.:23.0              3rd Qu.: 2.0000        
##  Max.   :53.00            Max.   :31.0              Max.   :16.0000        
##  stays_in_week_nights     adults          children           babies         
##  Min.   : 0.000       Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000000  
##  1st Qu.: 1.000       1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.000000  
##  Median : 2.000       Median : 2.000   Median : 0.0000   Median : 0.000000  
##  Mean   : 2.502       Mean   : 1.858   Mean   : 0.1042   Mean   : 0.007948  
##  3rd Qu.: 3.000       3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 0.000000  
##  Max.   :41.000       Max.   :55.000   Max.   :10.0000   Max.   :10.000000  
##  previous_cancellations previous_bookings_not_canceled booking_changes  
##  Min.   : 0.00000       Min.   : 0.0000                Min.   : 0.0000  
##  1st Qu.: 0.00000       1st Qu.: 0.0000                1st Qu.: 0.0000  
##  Median : 0.00000       Median : 0.0000                Median : 0.0000  
##  Mean   : 0.08714       Mean   : 0.1316                Mean   : 0.2212  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                3rd Qu.: 0.0000  
##  Max.   :26.00000       Max.   :72.0000                Max.   :21.0000  
##  days_in_waiting_list      adr          required_car_parking_spaces
##  Min.   :  0.000      Min.   :  -6.38   Min.   :0.00000            
##  1st Qu.:  0.000      1st Qu.:  70.00   1st Qu.:0.00000            
##  Median :  0.000      Median :  95.00   Median :0.00000            
##  Mean   :  2.331      Mean   : 102.00   Mean   :0.06189            
##  3rd Qu.:  0.000      3rd Qu.: 126.00   3rd Qu.:0.00000            
##  Max.   :391.000      Max.   :5400.00   Max.   :8.00000            
##  total_of_special_requests hotel_City Hotel hotel_Resort Hotel    meal_BB      
##  Min.   :0.0000            Min.   :0.000    Min.   :0.000      Min.   :0.0000  
##  1st Qu.:0.0000            1st Qu.:0.000    1st Qu.:0.000      1st Qu.:1.0000  
##  Median :0.0000            Median :1.000    Median :0.000      Median :1.0000  
##  Mean   :0.5717            Mean   :0.667    Mean   :0.333      Mean   :0.7726  
##  3rd Qu.:1.0000            3rd Qu.:1.000    3rd Qu.:1.000      3rd Qu.:1.0000  
##  Max.   :5.0000            Max.   :1.000    Max.   :1.000      Max.   :1.0000  
##     meal_FB            meal_HB          meal_SC        meal_Undefined    
##  Min.   :0.000000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.0000   Median :0.00000   Median :0.000000  
##  Mean   :0.006712   Mean   :0.1214   Mean   :0.08946   Mean   :0.009798  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.0000   Max.   :1.00000   Max.   :1.000000  
##  market_segment_Aviation market_segment_Complementary market_segment_Corporate
##  Min.   :0.000000        Min.   :0.000000             Min.   :0.00000         
##  1st Qu.:0.000000        1st Qu.:0.000000             1st Qu.:0.00000         
##  Median :0.000000        Median :0.000000             Median :0.00000         
##  Mean   :0.001993        Mean   :0.006174             Mean   :0.04299         
##  3rd Qu.:0.000000        3rd Qu.:0.000000             3rd Qu.:0.00000         
##  Max.   :1.000000        Max.   :1.000000             Max.   :1.00000         
##  market_segment_Direct market_segment_Groups market_segment_Offline TA/TO
##  Min.   :0.0000        Min.   :0.0000        Min.   :0.0000              
##  1st Qu.:0.0000        1st Qu.:0.0000        1st Qu.:0.0000              
##  Median :0.0000        Median :0.0000        Median :0.0000              
##  Mean   :0.1047        Mean   :0.1666        Mean   :0.2032              
##  3rd Qu.:0.0000        3rd Qu.:0.0000        3rd Qu.:0.0000              
##  Max.   :1.0000        Max.   :1.0000        Max.   :1.0000              
##  market_segment_Online TA distribution_channel_Corporate
##  Min.   :0.0000           Min.   :0.00000               
##  1st Qu.:0.0000           1st Qu.:0.00000               
##  Median :0.0000           Median :0.00000               
##  Mean   :0.4744           Mean   :0.05459               
##  3rd Qu.:1.0000           3rd Qu.:0.00000               
##  Max.   :1.0000           Max.   :1.00000               
##  distribution_channel_Direct distribution_channel_GDS
##  Min.   :0.0000              Min.   :0.000000        
##  1st Qu.:0.0000              1st Qu.:0.000000        
##  Median :0.0000              Median :0.000000        
##  Mean   :0.1218              Mean   :0.001623        
##  3rd Qu.:0.0000              3rd Qu.:0.000000        
##  Max.   :1.0000              Max.   :1.000000        
##  distribution_channel_TA/TO distribution_channel_Undefined is_repeated_guest_no
##  Min.   :0.000              Min.   :0.0e+00                Min.   :0.000       
##  1st Qu.:1.000              1st Qu.:0.0e+00                1st Qu.:1.000       
##  Median :1.000              Median :0.0e+00                Median :1.000       
##  Mean   :0.822              Mean   :8.4e-06                Mean   :0.968       
##  3rd Qu.:1.000              3rd Qu.:0.0e+00                3rd Qu.:1.000       
##  Max.   :1.000              Max.   :1.0e+00                Max.   :1.000       
##  is_repeated_guest_yes reserved_room_type_A reserved_room_type_B
##  Min.   :0.00000       Min.   :0.00         Min.   :0.000000    
##  1st Qu.:0.00000       1st Qu.:0.00         1st Qu.:0.000000    
##  Median :0.00000       Median :1.00         Median :0.000000    
##  Mean   :0.03201       Mean   :0.72         Mean   :0.009369    
##  3rd Qu.:0.00000       3rd Qu.:1.00         3rd Qu.:0.000000    
##  Max.   :1.00000       Max.   :1.00         Max.   :1.000000    
##  reserved_room_type_C reserved_room_type_D reserved_room_type_E
##  Min.   :0.00000      Min.   :0.0000       Min.   :0.00000     
##  1st Qu.:0.00000      1st Qu.:0.0000       1st Qu.:0.00000     
##  Median :0.00000      Median :0.0000       Median :0.00000     
##  Mean   :0.00783      Mean   :0.1613       Mean   :0.05464     
##  3rd Qu.:0.00000      3rd Qu.:0.0000       3rd Qu.:0.00000     
##  Max.   :1.00000      Max.   :1.0000       Max.   :1.00000     
##  reserved_room_type_F reserved_room_type_G reserved_room_type_H
##  Min.   :0.00000      Min.   :0.00000      Min.   :0.000000    
##  1st Qu.:0.00000      1st Qu.:0.00000      1st Qu.:0.000000    
##  Median :0.00000      Median :0.00000      Median :0.000000    
##  Mean   :0.02431      Mean   :0.01752      Mean   :0.005055    
##  3rd Qu.:0.00000      3rd Qu.:0.00000      3rd Qu.:0.000000    
##  Max.   :1.00000      Max.   :1.00000      Max.   :1.000000    
##  reserved_room_type_L assigned_room_type_A assigned_room_type_B
##  Min.   :0.00e+00     Min.   :0.0000       Min.   :0.00000     
##  1st Qu.:0.00e+00     1st Qu.:0.0000       1st Qu.:0.00000     
##  Median :0.00e+00     Median :1.0000       Median :0.00000     
##  Mean   :5.05e-05     Mean   :0.6212       Mean   :0.01816     
##  3rd Qu.:0.00e+00     3rd Qu.:1.0000       3rd Qu.:0.00000     
##  Max.   :1.00e+00     Max.   :1.0000       Max.   :1.00000     
##  assigned_room_type_C assigned_room_type_D assigned_room_type_E
##  Min.   :0.0000       Min.   :0.0000       Min.   :0.00000     
##  1st Qu.:0.0000       1st Qu.:0.0000       1st Qu.:0.00000     
##  Median :0.0000       Median :0.0000       Median :0.00000     
##  Mean   :0.0198       Mean   :0.2117       Mean   :0.06508     
##  3rd Qu.:0.0000       3rd Qu.:0.0000       3rd Qu.:0.00000     
##  Max.   :1.0000       Max.   :1.0000       Max.   :1.00000     
##  assigned_room_type_F assigned_room_type_G assigned_room_type_H
##  Min.   :0.00000      Min.   :0.00000      Min.   :0.000000    
##  1st Qu.:0.00000      1st Qu.:0.00000      1st Qu.:0.000000    
##  Median :0.00000      Median :0.00000      Median :0.000000    
##  Mean   :0.03139      Mean   :0.02135      Mean   :0.005955    
##  3rd Qu.:0.00000      3rd Qu.:0.00000      3rd Qu.:0.000000    
##  Max.   :1.00000      Max.   :1.00000      Max.   :1.000000    
##  assigned_room_type_I assigned_room_type_K assigned_room_type_L
##  Min.   :0.000000     Min.   :0.000000     Min.   :0.0e+00     
##  1st Qu.:0.000000     1st Qu.:0.000000     1st Qu.:0.0e+00     
##  Median :0.000000     Median :0.000000     Median :0.0e+00     
##  Mean   :0.003003     Mean   :0.002347     Mean   :8.4e-06     
##  3rd Qu.:0.000000     3rd Qu.:0.000000     3rd Qu.:0.0e+00     
##  Max.   :1.000000     Max.   :1.000000     Max.   :1.0e+00     
##  deposit_type_No Deposit deposit_type_Non Refund deposit_type_Refundable
##  Min.   :0.0000          Min.   :0.0000          Min.   :0.000000       
##  1st Qu.:1.0000          1st Qu.:0.0000          1st Qu.:0.000000       
##  Median :1.0000          Median :0.0000          Median :0.000000       
##  Mean   :0.8761          Mean   :0.1226          Mean   :0.001362       
##  3rd Qu.:1.0000          3rd Qu.:0.0000          3rd Qu.:0.000000       
##  Max.   :1.0000          Max.   :1.0000          Max.   :1.000000       
##  customer_type_Contract customer_type_Group customer_type_Transient
##  Min.   :0.00000        Min.   :0.000000    Min.   :0.00           
##  1st Qu.:0.00000        1st Qu.:0.000000    1st Qu.:0.75           
##  Median :0.00000        Median :0.000000    Median :1.00           
##  Mean   :0.03428        Mean   :0.004794    Mean   :0.75           
##  3rd Qu.:0.00000        3rd Qu.:0.000000    3rd Qu.:1.00           
##  Max.   :1.00000        Max.   :1.000000    Max.   :1.00           
##  customer_type_Transient-Party
##  Min.   :0.0000               
##  1st Qu.:0.0000               
##  Median :0.0000               
##  Mean   :0.2109               
##  3rd Qu.:0.0000               
##  Max.   :1.0000

We can see that all the variables are now ready to process. Addressing the problem of “Curse of Dimensionality”, it is true that we have increased it a ton, but as we have a lot more observations, we are good to go without having this problem. It is true that caret would take care of all the dummy creations, but as we want to know more in depth the realtionship we have opted to created the our own.

hotel.df = hotel.df[, -which(names(hotel.df) == "is_repeated_guest_no")]

We remove some extra varaibles that are redundant.

So, let’s take a look at the correlation of the variables.

df = cbind(is_canceled = as.numeric(hotel.df$is_canceled), hotel.df[, -1])
R = cor(df)

ggcorr(df, cor_matrix = R, label = TRUE)

If we zoom a bit, we can tell that the reserved_room_type is really similar to the assigned one. However, we will keep it as our model could look at this difference to predict better. It happens the same with the market and distribution variable.

Also, the created dummy variables are highly correlated but that makes sense, therefore we won’t do anything to them.

Moreover, we see that the arrival_date_week_number is just the same as the month so we will remove it.

hotel.df = hotel.df[, -which(names(hotel.df) == "arrival_date_week_number")]

rm(list = c("R", "colOrder"))

Finally lets take a look at some pca to see which variables are really important, so we understand better what the models will do when they predict.

pca = prcomp(df)
fviz_contrib(pca, choice = "var", axes = 1)

As usual, the pca without scaling is not really that useful, so let’s scale the data.

pca_scaled = prcomp(scale(df))
fviz_contrib(pca_scaled, choice = "var", axes = 1)

Here we can really see that the variables are somewhat correlated. We can see that the date does not really matter, nor the babies or the meal. However the room_type is really important as the deposit_type or the room_type.

Now that we have somewhat an idea about our data set, let’s try to visualize it and see if we find any perceptions that we can further use in our analysis.

rm(list = setdiff(ls(), c("hotel.df", "hotel.numeric.df", "dfToNumeric")))

Visualization

First things first, let’s plot the correlation of all the variables with respect to the is_canceled one and see if there are any really important ones.

corr_cancelled = sort(cor(hotel.df[,-1])[1,], decreasing = T)

corr = data.frame(corr_cancelled)

ggplot(corr,aes(x = row.names(corr), y = corr_cancelled)) + geom_bar(stat = "identity") + 
  scale_x_discrete(limits= row.names(corr)) +  labs(x = "", y = "is_canceled", title = "Correlations") + 
  theme(plot.title = element_text(hjust = 0, size = rel(1.5)), axis.text.x = element_text(angle = 45, hjust = 1))

Here, we can clearly see that the hotel is the most important variable, and this could mean that there is one hotel that a lot of customers cancel their reservation. Moreover, the room type is also important as we saw with the PCA. And the date it is not that important.

Now, let’s plot each variable boxplot regarding the is_canceled one.

featurePlot(x = hotel.df[, -1], 
            y = hotel.df[, 1], 
            plot = "box",
            strip=strip.custom(par.strip.text=list(cex=.7)),
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")))

We first see an outlier in the adr so we will remove it. Also we can see that the hotel boxplot is really similar so we will take a further look on that. And the dates are not that important, but the year is. So let’s remove those variables so we reduce some noise.

hotel.df = hotel.df[-which(hotel.df$adr > 1000),]
hotel.df = hotel.df[, -which(names(hotel.df) == "arrival_date_month")]
hotel.df = hotel.df[, -which(names(hotel.df) == "arrival_date_day_of_month")]

Clean the environment.

rm(list = setdiff(ls(), c("hotel.df")))

Classification

Setup

The first part is to divide the data set into three parts. One for the final testing (only used for the final model chosen), other for the testing of all the models and the last ones for the training. As the data set is huge, we will reduced the training data set size for shake of time and efficiency.

indexes.test.final = createDataPartition(hotel.df$is_canceled, p = 0.3, list = FALSE)

hotel.test.final = hotel.df[indexes.test.final, ]

indexes.test = createDataPartition(hotel.df$is_canceled[-indexes.test.final], p = 0.5, list = FALSE)

hotel.test = hotel.df[indexes.test, ]
hotel.train = hotel.df[-indexes.test, ]

Now, let’s set up caret. It will be our main package that we will use. So let’s define some functions that will speed up our work substantially.

Let’s separate the data set into the independent variables and the dependent one.

X.train = hotel.train[, -1]
y.train = hotel.train[, 1]

X.test = hotel.test[, -1]
y.test = hotel.test[, 1]

Because we are processing really complex models later on, we will be loading the binaries to speed up the process when creating the HTML file (take care all models have been run and tested before hand). Due to this let’s load the training and test data that we used to generate those binaries.

rm(list = setdiff(ls(), c("hotel.df")))

load("bin/data_partition.RData")

Caret Automation

In this approach, we will use cross validation 5 fold repeated 3 times. It is not that detailed, but because of the size of the data set otherwise we will spend years training the models. Also we will use parallel computing to speed up the process.

Moreover, we will use the kappa to measure the accuracy of the model, as it is better that the average. As it measures the agreement of different assessments of the model with the same target.

Finally, we will compute the confusion matrix all in this functions, this will make a lot easier the processing of different models. Note that the function has been tune for some data set so they work with the models used in this work.

runCaret = function(method, x, y, grid = NULL, x_test = NULL, y_test = NULL, family_model = NULL, preProcess = NULL) {
  
  env <- foreach:::.foreachGlobals # remove previous parallel computing instances.
  rm(list=ls(name=env), pos=env)
  
  if (method == "svmLinearWeights2") {
    trainControl = trainControl(method="repeatedcv", number = 5, repeats = 3,
                                savePredictions=TRUE)
  } else {
    trainControl = trainControl(method="repeatedcv", number = 5, repeats = 3,
                                savePredictions=TRUE, classProbs=TRUE)
  }
  
  cores = detectCores() - 1
  cl = makeCluster(cores)
  registerDoParallel(cl)
  
  tryCatch(
    if (is.null(family_model)) {
      model = train(x = x,
                  y = y,
                  method = method,
                  trControl = trainControl,
                  tuneGrid = grid,
                  metric = "Kappa",
                  preProc = preProcess)
    } else {
      model = train(x = x,
                  y = y,
                  method = method,
                  trControl = trainControl,
                  tuneGrid = grid,
                  metric = "Kappa", 
                  family = family_model,
                  preProc = preProcess)
    })
  
  stopCluster(cl)
  
  pred = predict(model, newdata = x_test)
  
  if (method == "svmLinearWeights2") {
    prob = pred
  } else {
    prob = predict(model, newdata = x_test, type = "prob")
  }
  
  CM = confusionMatrix(pred, y_test)
  
  print(paste(method, " used"))
  print(CM)
  
  return(list(model, prob, CM))
}

Interpretation Classification

First, we will be using models that are not flexible (low variance, high bias) in order to try to figure out how the variable is_canceled is predicted and how it relates to the others. That is we will first use statistical tools to get a feel of what is happening, and then machine learning tools to really try to predict the best way without taking care of what is going on inside.

Simple Decission Trees

In this approach, we will use the most basic tree so we can see what variables are important to predict.

dt_grid = expand.grid(cp = seq(0.2, 2, 0.2))

model.dt = runCaret("rpart", X.train, y.train, grid = dt_grid, x_test = X.test, y_test = y.test)
## [1] "rpart  used"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  19620 14781
##        yes    35  7178
##                                           
##                Accuracy : 0.644           
##                  95% CI : (0.6393, 0.6486)
##     No Information Rate : 0.5277          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3128          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9982          
##             Specificity : 0.3269          
##          Pos Pred Value : 0.5703          
##          Neg Pred Value : 0.9951          
##              Prevalence : 0.4723          
##          Detection Rate : 0.4715          
##    Detection Prevalence : 0.8267          
##       Balanced Accuracy : 0.6626          
##                                           
##        'Positive' Class : no              
## 

Looking at the confusion matrix, we see that the accuracy is not that bad (0.64) for this simple model. But the kappa is a little bit terrible (0.31). However, the most concerning part is the false negative (specificity) it is really low (0.33). So we will take a closer look at that.

But first, let’s see the tree.

rpart.plot(model.dt[[1]]$finalModel, digits = 3)

The most important variable for the simple tree is the deposit_type which is directly correlated with the refundable part. This could mean that the customers that can refund their reservation if they cancel are more likely to do that those who don’t. This in the real world has a lot of sense and can be one of the most determinant factors.

With this conclusion, we can asses the hotel manager to not include refund in order to reduce the number of cancellations, however this can have a really strong impact on those. It probably reduce the number of cancellation but it will also reduce the number of reservations.

So let’s keep digging as see if we find more similarities.

Bayes Classifiers

Linear Discriminant Analysis

#model.lda = runCaret("lda", X.train, y.train, x_test = X.test, y_test = y.test)

Comparing the CM, we see that it has a higher kappa (0.352), accuracy (0.6655) and a higher specificity (0.4). But let’s see why.

As LDA focuses on projecting the features in high dimension space into a lower one, it could have found a better way of separating the data than the decision tree.

#ggord(model.lda[[1]]$finalModel, y.train)

load("bin/lda.RData")

We cannot visualize the LDA in a graph as it has only one LDA so it would be a line. However we can look at the model itself and see if we reach any conclusion.

model.lda[[1]]$finalModel
## Call:
## lda(x, y)
## 
## Prior probabilities of groups:
##        no       yes 
## 0.7128531 0.2871469 
## 
## Group means:
##     lead_time arrival_date_year stays_in_weekend_nights stays_in_week_nights
## no   82.23436          2016.245               0.8949719             2.366745
## yes 144.92263          2016.179               0.9137038             2.546483
##       adults  children      babies previous_cancellations
## no  1.842167 0.1075331 0.009511708             0.01265202
## yes 1.897346 0.1075661 0.003469875             0.21864720
##     previous_bookings_not_canceled booking_changes days_in_waiting_list
## no                      0.16095480      0.28536940             1.244672
## yes                     0.02739849      0.09733676             3.594340
##          adr required_car_parking_spaces total_of_special_requests
## no  104.1435                  0.08094028                 0.7479216
## yes 104.6842                  0.00000000                 0.3298184
##     hotel_City Hotel hotel_Resort Hotel   meal_BB     meal_FB   meal_HB
## no         0.7420948          0.2579052 0.7627156 0.002777274 0.1112180
## yes        0.7527827          0.2472173 0.7809923 0.010274436 0.1136497
##        meal_SC meal_Undefined market_segment_Aviation
## no  0.11512071    0.008168452             0.003358141
## yes 0.08845929    0.006624307             0.001351899
##     market_segment_Complementary market_segment_Corporate market_segment_Direct
## no                   0.008132147               0.05029951            0.13064077
## yes                  0.002568609               0.02356811            0.04452255
##     market_segment_Groups market_segment_Offline TA/TO market_segment_Online TA
## no             0.09054275                    0.2059539                0.5110728
## yes            0.27344419                    0.1896264                0.4649182
##     distribution_channel_Corporate distribution_channel_Direct
## no                      0.06070067                  0.14592485
## yes                     0.03569014                  0.05889775
##     distribution_channel_GDS distribution_channel_TA/TO
## no              0.0028135778                  0.7905427
## yes             0.0009012663                  0.9045108
##     distribution_channel_Undefined is_repeated_guest_no is_repeated_guest_yes
## no                    1.815211e-05            0.9631875            0.03681249
## yes                   0.000000e+00            0.9862557            0.01374431
##     reserved_room_type_A reserved_room_type_B reserved_room_type_C
## no             0.7103467          0.010600835          0.005863133
## yes            0.7602181          0.008336713          0.007210130
##     reserved_room_type_D reserved_room_type_E reserved_room_type_F
## no             0.1785079           0.04955527           0.02652024
## yes            0.1372629           0.04402686           0.02032355
##     reserved_room_type_G reserved_room_type_H reserved_room_type_L
## no            0.01515702          0.003412598         3.630423e-05
## yes           0.01658330          0.005948357         9.012663e-05
##     assigned_room_type_A assigned_room_type_B assigned_room_type_C
## no             0.5821928           0.02292612           0.01849700
## yes            0.7446713           0.01203190           0.01013925
##     assigned_room_type_D assigned_room_type_E assigned_room_type_F
## no             0.2452532           0.06369577           0.03459793
## yes            0.1430760           0.04501825           0.02154026
##     assigned_room_type_G assigned_room_type_H assigned_room_type_I
## no            0.02080232          0.004211291         3.249229e-03
## yes           0.01698887          0.006038484         4.506331e-05
##     assigned_room_type_K assigned_room_type_L deposit_type_No Deposit
## no          0.0045743329         0.000000e+00               0.9977673
## yes         0.0004055698         4.506331e-05               0.6698662
##     deposit_type_Non Refund deposit_type_Refundable customer_type_Contract
## no              0.001052823             0.001179887             0.02597568
## yes             0.329007255             0.001126583             0.02775900
##     customer_type_Group customer_type_Transient customer_type_Transient-Party
## no          0.005536395               0.7276275                     0.2408604
## yes         0.001442026               0.8236222                     0.1471768
## 
## Coefficients of linear discriminants:
##                                          LD1
## lead_time                       0.0023088060
## arrival_date_year              -0.0993533473
## stays_in_weekend_nights         0.0348519190
## stays_in_week_nights            0.0342858639
## adults                          0.0985704560
## children                        0.1884694879
## babies                          0.1685903436
## previous_cancellations          0.1002348139
## previous_bookings_not_canceled -0.0120161391
## booking_changes                -0.1806627172
## days_in_waiting_list            0.0006315751
## adr                             0.0024306450
## required_car_parking_spaces    -1.0087587345
## total_of_special_requests      -0.4765361745
## hotel_City Hotel               -0.2017285049
## hotel_Resort Hotel              0.2017285049
## meal_BB                         0.0318668699
## meal_FB                         0.0624746072
## meal_HB                        -0.0469001590
## meal_SC                         0.0155284879
## meal_Undefined                 -0.3663322946
## market_segment_Aviation        -0.1946696110
## market_segment_Complementary    0.5075907328
## market_segment_Corporate       -0.0592958179
## market_segment_Direct          -0.0755899968
## market_segment_Groups          -0.0860662318
## market_segment_Offline TA/TO   -0.4551610593
## market_segment_Online TA        0.3598742692
## distribution_channel_Corporate  0.0029360137
## distribution_channel_Direct    -0.1431182750
## distribution_channel_GDS       -0.5218723213
## distribution_channel_TA/TO      0.1121797876
## distribution_channel_Undefined -0.0206218483
## is_repeated_guest_no           -0.0982915220
## is_repeated_guest_yes           0.0982915220
## reserved_room_type_A           -0.3053136001
## reserved_room_type_B            0.0361748955
## reserved_room_type_C            0.4526522490
## reserved_room_type_D            0.1214506646
## reserved_room_type_E            0.4975207883
## reserved_room_type_F            0.2750360913
## reserved_room_type_G            0.5986783870
## reserved_room_type_H            0.5398921771
## reserved_room_type_L            0.3492533628
## assigned_room_type_A            0.4000799557
## assigned_room_type_B            0.0483964991
## assigned_room_type_C           -0.3532595424
## assigned_room_type_D           -0.1653148913
## assigned_room_type_E           -0.4177191541
## assigned_room_type_F           -0.5591427321
## assigned_room_type_G           -0.7742871202
## assigned_room_type_H           -0.4482696255
## assigned_room_type_I           -0.7554780351
## assigned_room_type_K           -0.1596178765
## assigned_room_type_L            3.1990086797
## deposit_type_No Deposit        -1.5333419340
## deposit_type_Non Refund         1.5802524611
## deposit_type_Refundable        -1.3362532413
## customer_type_Contract         -0.1273727652
## customer_type_Group            -0.2095492887
## customer_type_Transient         0.1215072971
## customer_type_Transient-Party  -0.1084312817

If we look at the coefficients for LDA1, we see that the deposit_type is the variable with the highest absolute value, therefore the most important (as the decision tree told us). More over there are other not so important variables such as required_car_parking_spaces, total_of_special_requests or the market_segment.

  • Assigned_room_type_L (positive): it looks like the clients that where assigned the room L canceled it. So we can conclude that room type L is pretty bad.

  • Deposit_type_no_deposit and deposit_type_refundable (negative): this could reflect that those customers that are not able to get back their money or have not put any money are less likely to cancel.

  • Deposit_type_non_refund (positive): if a customer is not eligible for a refund, they are more likely to cancel the reservation.

  • Required_car_parking_spaces (negative): this could mean that if a customer requires parking, they are less likely to cancel the reservation.

  • Total_of_special_requests (negative): this could mean that if a customer has a lot of special requests (is picky), then the customer is likely to really want the hotel and have less probability to cancel it.

  • Market_segment_complementary (positive): this could mean that all the complementary customers are likely to cancel.

plot(varImp(model.lda[[1]], scale = F), scales = list(y = list(cex = .95)))

When we take a look at the variable importance of the LDA model, we can see that the conclusions are different. This is because just at looking at the coefficients does not give us any real hint because having a strong relationship with the predicted variable does not mean that is that important to the model. One hypothesis can be if the predictor variable has little variance and the predicted variable has high variance, then a standard 1 increase in the predictor variable can lead to a big increase in the predicted variable.

Here we can see that the most important variables are lead_time, deposit_type (as before) and total_of_special_guests.

plot.roc(y.test, model.lda[[2]][,2], col="darkblue", print.auc = TRUE,  auc.polygon=TRUE, grid=c(0.1, 0.2),
         grid.col=c("green", "red"), max.auc.polygon=TRUE,
         auc.polygon.col="lightblue", print.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases

After computing the ROC curve we can see that the AUC is 0.83 and we will use it to compare to other models to see which one is better.

# runCaret("lda", X.train, y.train, x_test = X.test, y_test = y.test, preProcess = c("center", "scale"))

Output

[1] "lda  used"
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  19280 13543
       yes   375  8416
                                         
               Accuracy : 0.6655         
                 95% CI : (0.661, 0.6701)
    No Information Rate : 0.5277         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.3518         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9809         
            Specificity : 0.3833         
         Pos Pred Value : 0.5874         
         Neg Pred Value : 0.9573         
             Prevalence : 0.4723         
         Detection Rate : 0.4633         
   Detection Prevalence : 0.7887         
      Balanced Accuracy : 0.6821         
                                         
       'Positive' Class : no   

If we scale and center the data, then there is no notizable improvement, and a decrease in the kappa. This can be as we have so many dummy variables that the data does not have its columns really that different.

Quadratic Discriminant Analysis

#model.qda = runCaret("qda", X.train, y.train, x_test = X.test, y_test = y.test)

Output

Something is wrong; all the Kappa metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :1     NA's   :1   

We cannot derive any meaningful information from the QDA, therefore we will skip it.

Naïve-Bayes Classification

prop.table(table(y.train))
## y.train
##        no       yes 
## 0.7128531 0.2871469

Here we see that using a Naïve classification of just predicting all the customers will not cancel the reservation is not a good idea because our accuracy will just only be 0.71. It is true that with the tools that we have used so far, it is the best prediction but let’ see if we can improve it.

Classic Naïve Bayes

#naive_bayes_grid = expand.grid(laplace = c(0,0.5,1.0), usekernel = TRUE, adjust=c(0,0.5,1.0))

#model.nb = runCaret("naive_bayes", X.train, y.train, grid = naive_bayes_grid ,x_test = X.test, y_test = y.test)

load("bin/nb.RData")

model.nb[[3]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  19616 14818
##        yes    39  7141
##                                           
##                Accuracy : 0.643           
##                  95% CI : (0.6384, 0.6476)
##     No Information Rate : 0.5277          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3109          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9980          
##             Specificity : 0.3252          
##          Pos Pred Value : 0.5697          
##          Neg Pred Value : 0.9946          
##              Prevalence : 0.4723          
##          Detection Rate : 0.4714          
##    Detection Prevalence : 0.8275          
##       Balanced Accuracy : 0.6616          
##                                           
##        'Positive' Class : no              
## 

Using the Naïve Bayes approach, we see that the accuracy and the kappa decrease. However, let’s see if the variables are similar to the LDA.

plot(varImp(model.nb[[1]], scale = F), scales = list(y = list(cex = .95)))

Here, we see that the most important variables are the lead_time, deposit_type, and total_special_requests. The last two are repeated and have almost the same value as the first one so our hypothesis is supported by this model also.

Also we will add to the hypothesis, that the earlier a customer has booked the hotel (in comparison to their arrival date) the less likely they are to cancel the reservation on average.

Variational Bayesian Multinomial Probit Regression

Here we are using a model similar to the multinomial bayesian model that is provided by R.

# naive_bayes_multinomial_grid = expand.grid(estimateTheta = seq(0, 2, 0.2))

#model.nb.multinomail = runCaret("vbmpRadial", X.train, y.train, grid = naive_bayes_multinomial_grid ,x_test = X.test, y_test = y.test)

Does not work, as our data set is very big to use this model.

Other Interpretation Models (GLM)

Logistic Regression

# model.lr = runCaret("glm", X.train, y.train, x_test = X.test, y_test = y.test, family_model = "binomial")

load("bin/logr.RData")
model.lr[[3]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  19334 14246
##        yes   321  7713
##                                           
##                Accuracy : 0.6499          
##                  95% CI : (0.6453, 0.6545)
##     No Information Rate : 0.5277          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3229          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9837          
##             Specificity : 0.3512          
##          Pos Pred Value : 0.5758          
##          Neg Pred Value : 0.9600          
##              Prevalence : 0.4723          
##          Detection Rate : 0.4646          
##    Detection Prevalence : 0.8069          
##       Balanced Accuracy : 0.6675          
##                                           
##        'Positive' Class : no              
## 

After running the logistic regression, we can tell that it is performing worse than the LDA, but let’s see if we can draw some conclusions.

model.lr[[1]]$finalModel$coefficients[sort(abs(model.lr[[1]]$finalModel$coefficients), decreasing = T, index.return = T)[[2]]]
##                    (Intercept)           assigned_room_type_C 
##                   1.691838e+17                  -1.155327e+15 
## `market_segment_Offline TA/TO`          market_segment_Groups 
##                   3.374083e+14                   7.997127e+14 
##    required_car_parking_spaces distribution_channel_Corporate 
##                  -3.526467e+15                  -3.599349e+15 
##           assigned_room_type_A           reserved_room_type_B 
##                  -6.521884e+14                   5.392365e+14 
##           assigned_room_type_D     `market_segment_Online TA` 
##                  -1.691994e+15                             NA 
##           reserved_room_type_C           assigned_room_type_B 
##                   9.098544e+14                  -8.723676e+14 
##           assigned_room_type_F           reserved_room_type_L 
##                  -2.628737e+15                             NA 
##           reserved_room_type_H           reserved_room_type_A 
##                   2.835641e+15                   3.201468e+14 
##          is_repeated_guest_yes           reserved_room_type_G 
##                             NA                   3.249183e+15 
##           is_repeated_guest_no           reserved_room_type_F 
##                   3.517116e+14                   1.933752e+15 
## distribution_channel_Undefined           reserved_room_type_E 
##                             NA                   1.701938e+15 
##       market_segment_Corporate        market_segment_Aviation 
##                   6.948822e+14                   5.473248e+14 
##           reserved_room_type_D           assigned_room_type_E 
##                   1.204380e+15                  -2.070179e+15 
##                 meal_Undefined                        meal_SC 
##                             NA                  -2.058664e+14 
##   `distribution_channel_TA/TO`           assigned_room_type_H 
##                  -3.524544e+15                  -2.684889e+15 
##   market_segment_Complementary    distribution_channel_Direct 
##                   5.662339e+14                  -3.776221e+15 
##         previous_cancellations          market_segment_Direct 
##                   3.381342e+14                   4.937290e+14 
##       distribution_channel_GDS                        meal_HB 
##                  -3.077365e+15                   1.242088e+13 
##                booking_changes           assigned_room_type_I 
##                  -2.052987e+14                  -4.418622e+15 
##                         babies             `hotel_City Hotel` 
##                  -1.890071e+14                  -1.311519e+14 
## previous_bookings_not_canceled                        meal_BB 
##                  -1.204735e+14                   9.936717e+13 
##      total_of_special_requests           `hotel_Resort Hotel` 
##                   1.011638e+14                             NA 
##           assigned_room_type_G              arrival_date_year 
##                  -3.377247e+15                  -8.299049e+13 
##        stays_in_weekend_nights                       children 
##                  -2.259539e+13                  -1.465835e+13 
##                         adults                        meal_FB 
##                  -1.447164e+13                   1.180741e+14 
##           stays_in_week_nights           days_in_waiting_list 
##                  -1.151472e+13                   1.079807e+12 
##                            adr                      lead_time 
##                   9.227722e+11                   4.889015e+11

Here we can see that the most important variables were: assigned_room_type_C, market_segment_Offline TA/TO, market_segment_Groups, required_car_parking_spaces, and distribution_channel_Corporate. For this, we clearly see that there are some similarities with LDA but not that many. This enforces our hypothesis with the market_segment_Groups, distribution_channel_Corporate, and required_car_parking_spaces but not the rest. This could be because the model is focusing on less important variables therefore giving a inferior kappa and accuracy.

plot(varImp(model.lr[[1]], scale = F), scales = list(y = list(cex = .95)))

Here, we see that if a customer is required a parking space then it is less likely to cancel the reservation on average (negative beta). And also customers that have cancelled before are more likely to cancel on average.

Conclusions

After using all the statistical tools, we conclude with the following hypothesis:

  • Customers that do not have to pay upfront the hotel or that it is refundable, they are more likely to cancel the reservation on average.
  • Customers that require parking space are less likely to cancel.
  • Customers that book a lot of time before their arrival time are less likely to cancel on average.
  • Customers that have canceled before are more likely to cancel on average.

If we think a bit, this really make sense and we can asses the hotels in order to reduce the number of clients that cancel. It is true that we can differentiate between both hotels and see if the customer that leave one hotel are differents from the other hotel, but it is far from the goal of this project.

Prediction Classification

After taking a look at the best variables for our model, let’s see if we can improve the kappa and the accuracy by using ML tools. For this we will use more flexible models with higher variance and see if the increase in variance accounts for a decrease in bias so we can predict better. Using this tools can be tricky and for the longest models, the binaries are going to be given or the output, otherwise it will take years to process.

Penalalized Logistic Regression

As the Logistic Regression has performed worse on average, let’s see if the Penalized Logistic Regression is better.

# model.plr = runCaret("plr", X.train, y.train, x_test = X.test, y_test = y.test)

load("bin/plr.RData")

Output

model.plr[[3]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  18792 11110
##        yes   863 10849
##                                           
##                Accuracy : 0.7123          
##                  95% CI : (0.7079, 0.7166)
##     No Information Rate : 0.5277          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4382          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9561          
##             Specificity : 0.4941          
##          Pos Pred Value : 0.6285          
##          Neg Pred Value : 0.9263          
##              Prevalence : 0.4723          
##          Detection Rate : 0.4516          
##    Detection Prevalence : 0.7186          
##       Balanced Accuracy : 0.7251          
##                                           
##        'Positive' Class : no              
## 

According to the confusion matrix we can see a huge increase in accuracy and kappa. If we dig deeper, we can see that PLR has increase substantially the specificity (what we really want because of minimum cost) with with a trade off in sensitivity (which is understandable).

model.plr[[1]]$finalModel$coefficients[sort(abs(model.plr[[1]]$finalModel$coefficients), decreasing = T, index.return = T)[[2]][1:10]]
## required_car_parking_spaces     deposit_type_Non Refund 
##                   4.8094786                  -3.5177018 
##      previous_cancellations     deposit_type_No Deposit 
##                  -2.8539411                   2.1398114 
##        assigned_room_type_A     deposit_type_Refundable 
##                  -1.6046853                   1.3672987 
##        reserved_room_type_A        assigned_room_type_B 
##                   1.2582918                  -0.9973125 
##        assigned_room_type_I        assigned_room_type_G 
##                   0.9780396                   0.8653565

Here the predictors with the highest coefficients are deposit_type again, required_car_parking again, and previous_cancellations. Here we can really see that the first two must be really important to predict whether a client is going to cancel the reservation or not.

plot(varImp(model.plr[[1]], scale = F), scales = list(y = list(cex = .95)))

Here in the variable importance, we see the almost the same as the LDA, so our previous hypothesis is reinforces and we can say that in order to know if a customer is going to cancel the reservation on not, we could see those variables and assess the hotel to improve those in order to give a better service and reduce the cancellation ratio.

KNN

# knn_grid = expand.grid(k = seq(2, 50, 4))

# model.knn = runCaret("knn", X.train, y.train, grid = knn_grid ,x_test = X.test, y_test = y.test)

load("bin/knn.RData")
model.knn[[3]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  16667  8071
##        yes  2988 13888
##                                         
##                Accuracy : 0.7342        
##                  95% CI : (0.73, 0.7385)
##     No Information Rate : 0.5277        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.474         
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.8480        
##             Specificity : 0.6325        
##          Pos Pred Value : 0.6737        
##          Neg Pred Value : 0.8229        
##              Prevalence : 0.4723        
##          Detection Rate : 0.4005        
##    Detection Prevalence : 0.5945        
##       Balanced Accuracy : 0.7402        
##                                         
##        'Positive' Class : no            
## 

Here we see that the kappa and the accuracy are better that Penalized Logistic Regression. And the specificity is much better, this is because of the trade off between variance and bias, we have used a more flexible model to predict better the bias with the trade off of a higher variance. Let’s see which are the most important variables for KNN.

plot(varImp(model.knn[[1]], scale = F), scales = list(y = list(cex = .95)))

As our hypothesis said, lead_time and deposit_type were really important, followed by special_requests. Therefore, we are pretty confident that the hotel will decrease the cancellation rate if it follows our recommendations.

SVM

# svm_grid = expand.grid(cost = c(1, 2, 3), Loss = c(0.2, 0.5, 0.8),  weight = c(0.5, 1, 1.5))

# model.svm = runCaret("svmLinearWeights2", X.train, y.train, grid = svm_grid ,x_test = X.test, y_test = y.test)

load("bin/svm.RData")
model.svm[[3]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  19631 16216
##        yes    24  5743
##                                          
##                Accuracy : 0.6097         
##                  95% CI : (0.605, 0.6144)
##     No Information Rate : 0.5277         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2495         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9988         
##             Specificity : 0.2615         
##          Pos Pred Value : 0.5476         
##          Neg Pred Value : 0.9958         
##              Prevalence : 0.4723         
##          Detection Rate : 0.4717         
##    Detection Prevalence : 0.8614         
##       Balanced Accuracy : 0.6302         
##                                          
##        'Positive' Class : no             
## 

After computing SVM, we see that the accuracy and the kappa are terrible so we won’t handle this model.

Random Forest

# rf_grid = expand.grid(mtry = seq(1, 30, 3))

# len = as.integer(nrow(hotel.train) / 2)

# model.rf = runCaret("rf", X.train[1:len, ], y.train[1:len], grid = rf_grid ,x_test = X.test[1:len, ], y_test = y.test[1:len])

After trying multiple ways to try to perform Random Forest, we have concluded that with this data set cannot be performed. We have tried to shrink the rows by more than half, the columns also, and even perform less trees. However, in every try that we did, R just stopped and R Studio when blank.

Gradient-Boost

# xgb_grid = expand.grid(nrounds = c(500,1000), max_depth = c(5,6,7), eta = c(0.01, 0.1, 1), gamma = c(1, 2, 3), colsample_bytree = c(1, 2), min_child_weight = c(1), subsample = c(0.2,0.5,0.8))

# model.xgb = runCaret("xgbTree", X.train, y.train, grid = xgb_grid ,x_test = X.test, y_test = y.test)

It happened the same with gradient boosting, our computer just could not handle the insane model for this insane data. It did not matter if we reduced the obervations, the predictor variables, or the hyper parameters. Even after spending two days processing, nothing happened. Therefore, we have opted out to leave it.

Cost Improvement

Now, let’s see which is the best threshold for the probabilities to assign on observation as canceled or not. For this we will imagine some costs and then fit the model to reduce this cost. For this as we have too many observations we will just use some wide steps and the ROC curve. Also for the complex model we won’t use it otherwise the computer would spend weeks computing the results.

cost.unit <- c(0, 50, 250, 70)

Let’s use it for LDA as it is a simple model and it gave us pretty good kappa and accuracy

model.lda = runCaret("lda", X.train, y.train, x_test = X.test, y_test = y.test)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in lda.default(x, grouping, ...): variables are collinear
## [1] "lda  used"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  19280 13543
##        yes   375  8416
##                                          
##                Accuracy : 0.6655         
##                  95% CI : (0.661, 0.6701)
##     No Information Rate : 0.5277         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3518         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9809         
##             Specificity : 0.3833         
##          Pos Pred Value : 0.5874         
##          Neg Pred Value : 0.9573         
##              Prevalence : 0.4723         
##          Detection Rate : 0.4633         
##    Detection Prevalence : 0.7887         
##       Balanced Accuracy : 0.6821         
##                                          
##        'Positive' Class : no             
## 
plot.roc(y.test, model.lda[[2]][,2], col="darkblue", print.auc = TRUE,  auc.polygon=TRUE, grid=c(0.1, 0.2),
         grid.col=c("green", "red"), max.auc.polygon=TRUE,
         auc.polygon.col="lightblue", print.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases

EconomicCost <- function(data, lev = NULL, model = NULL) {
  y.pred = data$pred 
  y.true = data$obs
  CM = confusionMatrix(y.pred, y.true)$table
  out = sum(as.vector(CM)*cost.unit)/sum(CM)
  names(out) <- c("EconomicCost")
  out
}

This is our function that predicts our cost for the model. So let’s see what is the cost for the LDA.

pred.lda = as.factor(ifelse(model.lda[[2]]$yes >= 0.5, "yes", "no"))

EconomicCost(data = data.frame(pred  = pred.lda, obs = y.test))
## EconomicCost 
##     95.96818
env <- foreach:::.foreachGlobals # remove previous parallel computing instances.
rm(list=ls(name=env), pos=env)

ctrl = trainControl(method="repeatedcv", number = 5, repeats = 3,
                                savePredictions=TRUE, summaryFunction = EconomicCost)

model.lda.cost = train(x = X.train,y = y.train, method = "lda", trControl = ctrl, metric = "EconomicCost", maximize = F)
## Warning: model fit failed for Fold1.Rep1: parameter=none Error in lda.default(x, grouping, ...) : 
##   variable 33 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold5.Rep1: parameter=none Error in lda.default(x, grouping, ...) : 
##   variable 55 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold2.Rep2: parameter=none Error in lda.default(x, grouping, ...) : 
##   variable 33 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold4.Rep2: parameter=none Error in lda.default(x, grouping, ...) : 
##   variable 55 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold1.Rep3: parameter=none Error in lda.default(x, grouping, ...) : 
##   variable 55 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold5.Rep3: parameter=none Error in lda.default(x, grouping, ...) : 
##   variable 33 appears to be constant within groups
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in lda.default(x, grouping, ...): variables are collinear
model.lda.cost
## Linear Discriminant Analysis 
## 
## 77281 samples
##    62 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 61825, 61825, 61824, 61825, 61825, 61825, ... 
## Resampling results:
## 
##   EconomicCost
##   52.41088

Here we see that the cost is 52 compared to 95 gotten before. This is a huge improvement! Let’s see the most important variables.

plot(varImp(model.lda.cost, scale = F), scales = list(y = list(cex = .95)))

Here we just see similar results as the LDA computes, therefore we will assume that caret has tweaked the betas but kept the same relationship.

rm(list = setdiff(ls(), c("hotel.df")))

load("bin/data_partition.RData")

Conclusion (Ensamble)

Now, to make our final prediction, we will be using the best three models that in our case were: Penalized Logistic Regression, KNN, and LDA. We won’t be using the cost improved LDA as the others are not improve so we will keep it the same for this example. We could improve the other two but the computer would spend weeks processing.

We are using the best three models to predict our final data set.

For this, let’s import the three models

load("bin/lda.RData")
load("bin/knn.RData")
load("bin/plr.RData")

We are going to use the mode of the models.

mode = function(v) {
  uniqv = unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
finalPrediction = function(data) {
  pred.lda = predict(model.lda[[1]], newdata = data)
  pred.plr = predict(model.plr[[1]], newdata = data)
  pred.knn = predict(model.knn[[1]], newdata = data)

  finalPred = apply(data.frame(pred.lda, pred.plr, pred.knn), 1, mode) 
  return(finalPred)
}

Here we construct the function that will take care of predicting at each model and making the mode. So, now that that’s done, let’s use the final test data set and see how well can we predict the data. For this let’s separate the data first.

X.test.final = hotel.test.final[, -1]
y.test.final = hotel.test.final[, 1]
# pred.final = as.factor(finalPrediction(X.test.final))

load("bin/classification.pred.final.RData")

confusionMatrix(pred.final, y.test.final)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  22130  7114
##        yes   294  6131
##                                           
##                Accuracy : 0.7923          
##                  95% CI : (0.7881, 0.7965)
##     No Information Rate : 0.6287          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5028          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9869          
##             Specificity : 0.4629          
##          Pos Pred Value : 0.7567          
##          Neg Pred Value : 0.9542          
##              Prevalence : 0.6287          
##          Detection Rate : 0.6204          
##    Detection Prevalence : 0.8199          
##       Balanced Accuracy : 0.7249          
##                                           
##        'Positive' Class : no              
## 

After looking at the confusion matrix it gives an accuracy of almost 0.8 and a kappa of 0.5. That is amazing! It is almost a 15% increase from the highest single model that we had.

We see that we can predict the customer that won’t cancel pretty accurately but not so much the customers that will cancel. For this we could do the cost approach that we did previously with the final model, however it is out of the scope of this project.

In conclusion, we could predict if a customer would cancel with approximately 80% confidence. This is pretty high and this will be crucial for the hotels.

Regression

For this approach we will use the variable “adr” (average daily rate) to try to predict for a specific customer. For this approach let’s tweak a little bit our data set.

rm(list = setdiff(ls(), c("hotel.df")))

adrCol = which(names(hotel.df) == "adr")
colOrder = 1:ncol(hotel.df)
colOrder[adrCol] = 1
colOrder[1] = adrCol

hotel.df = hotel.df[, colOrder]

We made the adr variable our first one for shake of simplicity.

hotel.df = hotel.df[, -which(names(hotel.df) == "is_canceled")]

We remove these variables, as we are not trying to predict the cancelation.

And now, let’s divide the data set.

indexes.test.final = createDataPartition(hotel.df$adr, p = 0.3, list = FALSE)

hotel.test.final = hotel.df[indexes.test.final, ]

indexes.test = createDataPartition(hotel.df$adr[-indexes.test.final], p = 0.5, list = FALSE)

hotel.test = hotel.df[indexes.test, ]
hotel.train = hotel.df[-indexes.test, ]
X.train = hotel.train[, -1]
y.train = hotel.train[, 1]

X.test = hotel.test[, -1]
y.test = hotel.test[, 1]

Now, we have our data set ready to perform regression.

rm(list = setdiff(ls(), c("hotel.df")))

load("bin/data_partition_regression.RData")

This is a check point with the data loaded.

runCaret = function(method, form, data, test, grid = NULL, family_model = NULL, preProcess = NULL) {
  
  env <- foreach:::.foreachGlobals # remove previous parallel computing instances.
  rm(list=ls(name=env), pos=env)
  
  trainControl = trainControl(method="repeatedcv", number = 5, repeats = 3, allowParallel = T)
  
  cores = detectCores() - 1
  cl = makeCluster(cores)
  registerDoParallel(cl)
  
  tryCatch(
    if (is.null(family_model)) {
      model = train(form, 
                    data,
                  method = method,
                  trControl = trainControl,
                  tuneGrid = grid,
                  preProc = preProcess)
    } else {
      model = train(form, 
                    data,
                  method = method,
                  trControl = trainControl,
                  tuneGrid = grid,
                  family = family_model,
                  preProc = preProcess)
    })
  
  stopCluster(cl)
  
  pred = predict(model, test[, -1])
  
  cor.pred = cor(test[, 1], pred)^2
  
  print(paste(method, " used"))
  print(model)
  print(paste("R^2 of the testing is ", cor.pred))
  
  return(list(model, pred, cor.pred))
}

We have tweaked slightly the function to acomodate better regression.

Visualization of adr Variable

Here let’s start first to visualized the data according to the adr variable

ggcorr(hotel.df, label = TRUE)

If we just look at the adr correlations, we do not see any meaningful correlation with the other variables. However we see that the assigned_rooms have a really high correlation. This is clearly known as they come from the same factor variable and are just dummies. Therefore, we will have to take care with the collinearity in our models.

corr_price <- sort(cor(hotel.train)["adr",], decreasing = T)
## Warning in cor(hotel.train): the standard deviation is zero
corr=data.frame(corr_price)

ggplot(corr,aes(x = row.names(corr), y = corr_price)) + 
  geom_bar(stat = "identity", fill = "lightblue") + 
  scale_x_discrete(limits= row.names(corr)) +
  labs(x = "Predictors", y = "Price", title = "Correlations") + 
  theme(plot.title = element_text(hjust = 0, size = rel(1.5)),
        axis.text.x = element_text(angle = 45, hjust = 1))

Here we see a really interesting graph. We see that the number of children and adults is strongly positively correlated with adr, and also the type of room. Moreover, other types of rooms such as A are negatively correlated (guess it is the cheapest type of room).

rm(list = c("corr_price", "corr"))

Now let’s see the distribution of the adr.

hotel.train %>% ggplot(aes(x=adr)) + geom_density(fill = "black")

The distribution is not any smooth one, however we see like a sum of a normal distribution with median 100 aprox and a streched right side, and something weird at the beginning. So what about the log.

hotel.train %>% ggplot(aes(x=adr)) + geom_density(fill="black") + scale_x_log10() 
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1233 rows containing non-finite values (stat_density).

Here we do see some improvement, and we have removed the left weird part. However, as there is not noticable improvement, let’s use the normal adr without any change.

ggplot(hotel.train, aes(x = (adults + children + babies), y=adr)) + ylab("adr") + 
  geom_point(alpha=0.6) + ggtitle("Price per person")

Here we see some really interesting patterns, we see that the price slightly increases with the number of people at the beginning (with some outliers), but there are some really far points with adr = 0. Let’s investigate a little bit.

weirdIndexes = which(hotel.train$adr == 0 & (hotel.train$adults + hotel.train$children + hotel.train$babies) >= 15)

View(hotel.train[weirdIndexes, ])

This looks like a promotion that the hotel did, so we will remove it, otherwise it will introduce a lot of noise in our model.

hotel.train = hotel.df[-weirdIndexes, ]

Now let’s clean the environment and perform some simple regressions to try to understand better the data.

rm(list = ls())

load("bin/data_partition_regression_ready.RData")

Be Aware

  • Collinearity: as stated before, we will have to take real care as the majority of our columns come from the creation of dummy variables.

  • Overfitting: because of the large amount of columns that we have, we could have overfitting on our models. However it is true that the ration of rows vs columns is really high so this problem is slightly mitigated. The problem with overfitting will only be aware when we try to estimate the betas, not predicting.

Interpretation Regression

Let’s try fitting a simple MR and see the huge overfitting.

model.mr = runCaret("lm", adr ~ ., hotel.train, test = hotel.test)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## [1] "lm  used"
## Linear Regression 
## 
## 118888 samples
##     60 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 95111, 95110, 95111, 95110, 95110, 95111, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   36.13239  0.4354296  25.97487
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is  0.410757105804983"

Our R² in the testing set using this model is 0.41. This is pretty bad and the first guess if because of the huge amount of overfitting that it has. However, let’s see if we can draw some meaningful conclusions.

plot(varImp(model.mr[[1]], scale = F), scales = list(y = list(cex = .95)))

Here we see that the most important variable is the one giving if the Hotel is City Hotel. This can be because City Hotel is much expensive than the other one. This is a reasonable approach and let’s see if it is true.

mean(hotel.train$adr[which(hotel.train$`hotel_City Hotel` == 1)])
## [1] 105.2624
mean(hotel.train$adr[which(hotel.train$`hotel_City Hotel` == 0)])
## [1] 95.33249

Here we see that on average, City Hotel is 10 units more expensive than the other hotel. So this predictor will be very handy to predict the price per night.

model.mr[[1]]$finalModel
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
##                           (Intercept)                              lead_time  
##                            -1.856e+04                             -2.968e-02  
##                     arrival_date_year                stays_in_weekend_nights  
##                             9.223e+00                             -5.020e-01  
##                  stays_in_week_nights                                 adults  
##                             7.309e-01                              8.939e+00  
##                              children                                 babies  
##                             1.635e+01                              2.405e+00  
##                previous_cancellations         previous_bookings_not_canceled  
##                            -1.033e+00                             -3.069e-02  
##                       booking_changes                   days_in_waiting_list  
##                             5.373e-01                              7.820e-03  
##           required_car_parking_spaces              total_of_special_requests  
##                             7.073e+00                              2.630e+00  
##              `\\`hotel_City Hotel\\``             `\\`hotel_Resort Hotel\\``  
##                             2.712e+01                                     NA  
##                               meal_BB                                meal_FB  
##                            -3.003e+01                              1.372e+01  
##                               meal_HB                                meal_SC  
##                             7.488e-01                             -4.365e+01  
##                        meal_Undefined                market_segment_Aviation  
##                                    NA                             -1.347e+01  
##          market_segment_Complementary               market_segment_Corporate  
##                            -1.081e+02                             -2.496e+01  
##                 market_segment_Direct                  market_segment_Groups  
##                            -5.579e-01                             -3.099e+01  
##  `\\`market_segment_Offline TA/TO\\``       `\\`market_segment_Online TA\\``  
##                            -2.460e+01                                     NA  
##        distribution_channel_Corporate            distribution_channel_Direct  
##                             2.069e+01                              1.633e+01  
##              distribution_channel_GDS     `\\`distribution_channel_TA/TO\\``  
##                             3.800e+01                              1.756e+01  
##        distribution_channel_Undefined                  is_repeated_guest_yes  
##                                    NA                             -1.615e+01  
##                  reserved_room_type_A                   reserved_room_type_B  
##                            -7.105e+01                             -9.469e+01  
##                  reserved_room_type_C                   reserved_room_type_D  
##                            -2.744e+01                             -4.631e+01  
##                  reserved_room_type_E                   reserved_room_type_F  
##                            -4.227e+01                             -3.047e+01  
##                  reserved_room_type_G                   reserved_room_type_H  
##                            -1.903e+01                              2.528e-01  
##                  reserved_room_type_L                   assigned_room_type_A  
##                                    NA                              1.341e+02  
##                  assigned_room_type_B                   assigned_room_type_C  
##                             1.377e+02                              1.366e+02  
##                  assigned_room_type_D                   assigned_room_type_E  
##                             1.269e+02                              1.351e+02  
##                  assigned_room_type_F                   assigned_room_type_G  
##                             1.404e+02                              1.436e+02  
##                  assigned_room_type_H                   assigned_room_type_I  
##                             1.373e+02                              8.811e+01  
##                  assigned_room_type_K                   assigned_room_type_L  
##                             8.840e+01                                     NA  
##       `\\`deposit_type_No Deposit\\``        `\\`deposit_type_Non Refund\\``  
##                            -1.357e+01                             -3.137e-01  
##               deposit_type_Refundable                 customer_type_Contract  
##                                    NA                             -1.883e+00  
##                   customer_type_Group                customer_type_Transient  
##                            -1.241e+01                             -1.280e+00  
## `\\`customer_type_Transient-Party\\``  
##                                    NA
qplot(model.mr[[2]], hotel.test[, 1]) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 300), y = c(0, 300)) +
  geom_abline(intercept = 0, slope = 1, colour = "blue")
## Warning: Removed 313 rows containing missing values (geom_point).

And here we see the huge overfitting that we have. It is tremendous. What about if we scale the data.

model.mr.scale = runCaret("lm", adr ~ ., hotel.train, test = hotel.test, preProcess = c("scale", "center"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## [1] "lm  used"
## Linear Regression 
## 
## 118888 samples
##     60 predictor
## 
## Pre-processing: scaled (60), centered (60) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 95110, 95109, 95111, 95112, 95110, 95110, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   36.12724  0.4355737  25.97483
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is  0.410757105804979"

There is no improvement.

qplot(model.mr.scale[[2]], hotel.test[, 1]) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 300), y = c(0, 300)) +
  geom_abline(intercept = 0, slope = 1.25, colour = "blue")
## Warning: Removed 313 rows containing missing values (geom_point).

And the overfitting is the same.

Now let’s try to fit some other models with much less overfitting.

Hypothesis Model

model.hmr = runCaret("lm", adr ~ lead_time + I(adults + children + babies) + `hotel_City Hotel` + 
                      I(stays_in_weekend_nights + stays_in_week_nights),
                    hotel.train, test = hotel.test)
## [1] "lm  used"
## Linear Regression 
## 
## 118888 samples
##      7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 95112, 95111, 95109, 95109, 95111, 95110, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   43.6508  0.1794635  32.04992
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is  0.172311291668375"

Here we see that the testing R² is terrible 0.17. So we won’t use this model. However, let’s see the relationships.

plot(varImp(model.hmr[[1]], scale = F), scales = list(y = list(cex = .95)))

Here we see that the number of people is the most important variable. However, due to the low R² we would not trust this model at all.

Feature Engineering

hotel.df$people = hotel.df$adults + hotel.df$children + hotel.df$babies

Just joined the three variables to see if it is a good predictor for our models.

hotel.df = hotel.df[, -which(names(hotel.df) == "hotel_Resort Hotel")]

We remove the Resort Hotel as it is the same as the City Hotel but inverse.

rm(list = ls())

load("bin/data_partition_regression_ready_feature.RData")

Now we load the libraries to speed up the process.

Forward Regression

# fwdr_grid = expand.grid(nvmax = 4:(ncol(hotel.train)-10))

# model.fwdr = runCaret("leapForward", adr ~ ., hotel.train, test = hotel.test, grid = fwdr_grid)

load("bin/fwdr.RData")

The best model was achieved with 51 variables with a R² in the testing set of 0.41, same as with all the variables. Therefore this method is better than the first one. This is because with less variables we can achieve the same R². What R² means is that 41% of the variance on the test set is explained by the model. That is that 59% is explained by just noise. This is not very good for predictions!

So let’s see the variables that are important for this model.

plot(varImp(model.fwdr[[1]], scale = F), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

Here the most important variables are the reserved_room_type_A, the number of children and if the market segment is online. Now let’s see their coefficients and see if it is a positive or negative relationship to draw some conclusions.

model.fwdr[[1]]$bestTune
##    nvmax
## 48    51

We see that the best tune is when it is 51.

sort(coef(model.fwdr[[1]]$finalModel, 51), decreasing = T)
## NULL

Here we see that reserved_room_type_A has a coefficient of -40, that is on average, a customer will pay 40 units less if they book a room type A in contrast to the normal.

Also, the number of children has a coefficient of 16 (positive), that is on average, booking a room for children is more expensive than just booking a room for adults (8). That is children pay 8 units more compared to adults.

This are really interesting conclusions as they are pretty accurate for what we expect. And have we decrease the overfitting?

qplot(model.fwdr[[2]], hotel.test[, 1]) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 300), y = c(0, 300)) +
  geom_abline(intercept = 0, slope = 1, colour = "blue")
## Warning: Removed 319 rows containing missing values (geom_point).

Slightly, but not a lot. The values that the observed are zero, probably are promotions as stated before, so in order to fix this we would need to as the manager of the data set to include another variable, called promotions to indicated when they are or not in order to predict more accurately. However, as this data set was not designed to predict this variable, then that is the reason why we get pretty low R².

Now let’s use some backwards regression, however we will expect to get the same results as forward regression.

Backward Regression

bwdr_grid = expand.grid(nvmax = 4:(ncol(hotel.train)-10))

model.bwdr = runCaret("leapBackward", adr ~ ., hotel.train, test = hotel.test, grid = bwdr_grid)
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 9
## linear dependencies found
## Reordering variables and trying again:
## Warning in rval$lopt[] <- rval$vorder[rval$lopt]: number of items to replace is
## not a multiple of replacement length
## [1] "leapBackward  used"
## Linear Regression with Backwards Selection 
## 
## 77281 samples
##    60 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 61824, 61825, 61826, 61825, 61824, 61825, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    4     42.63806  0.1635902  31.62946
##    5     42.07633  0.1854437  31.15672
##    6     41.95973  0.1899737  31.10753
##    7     41.80578  0.1959081  31.00519
##    8     41.12272  0.2219793  30.34479
##    9     40.21661  0.2567459  29.37830
##   10     39.46677  0.2841285  28.80734
##   11     39.19366  0.2939538  28.59910
##   12     39.11342  0.2968612  28.53906
##   13     39.03658  0.2996077  28.46156
##   14     38.31308  0.3250146  27.57253
##   15     38.24822  0.3273214  27.50214
##   16     38.13892  0.3310252  27.41621
##   17     37.58948  0.3500763  26.97205
##   18     37.56298  0.3510054  26.93853
##   19     37.44373  0.3551247  26.87642
##   20     37.16043  0.3648716  26.66441
##   21     37.08833  0.3673183  26.63213
##   22     36.17901  0.3978193  25.96245
##   23     36.10486  0.4003049  25.92161
##   24     36.08165  0.4010751  25.89831
##   25     36.04677  0.4022400  25.86691
##   26     36.03143  0.4027520  25.85509
##   27     36.00319  0.4036906  25.82611
##   28     35.97749  0.4045526  25.81342
##   29     35.94445  0.4056359  25.80400
##   30     35.93692  0.4058929  25.79347
##   31     35.92205  0.4063858  25.78218
##   32     35.90724  0.4068751  25.77032
##   33     35.83519  0.4092052  25.71811
##   34     35.70060  0.4136036  25.63097
##   35     35.38170  0.4240599  25.44348
##   36     35.32244  0.4260179  25.40520
##   37     35.20815  0.4297349  25.33339
##   38     35.17713  0.4307274  25.30880
##   39     35.08306  0.4337858  25.22767
##   40     35.05179  0.4348018  25.21287
##   41     35.02067  0.4358040  25.18366
##   42     34.99394  0.4366723  25.16937
##   43     34.95816  0.4378258  25.14809
##   44     34.94799  0.4381537  25.13851
##   45     34.94187  0.4383477  25.13046
##   46     34.93148  0.4386820  25.12765
##   47     34.88139  0.4403082  25.06850
##   48     34.81646  0.4423915  24.99902
##   49     34.80408  0.4427919  24.98899
##   50     34.79205  0.4431723  24.98484
##   51     34.78041  0.4435399  24.97422
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 51.
## [1] "R^2 of the testing is  0.399817616910282"

The same results we get as Forward Regression so there is no meaningful information we can extract form here. However let’s see if the most important variables are the same.

plot(varImp(model.bwdr[[1]], scale = F), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

And they are exactly the same as expected. For this reason, we won’t be using Stepwise Regression, as we will get the exact same results. The main conclusion that we get apart from some conclusions, is that this data set is not prepared to try to predict the variable adr and that is why we need as many as possible variables.

Custom model

Let’s create a model with less variables and try to get a similar R². Let’s use the top 20 aprox variables. Let’s try to reduce the overfitting.

custom.model = adr ~ reserved_room_type_A + children + `market_segment_Online TA` + adults + 
  assigned_room_type_A + market_segment_Groups + assigned_room_type_G + customer_type_Transient + assigned_room_type_F + 
  total_of_special_requests + reserved_room_type_D

model.custom.lr = runCaret("lm", custom.model,
                    hotel.train, test = hotel.test)
## [1] "lm  used"
## Linear Regression 
## 
## 77281 samples
##    11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 61826, 61826, 61824, 61824, 61824, 61826, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   39.39202  0.2866397  29.1128
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is  0.26355102272907"

It is pretty terrible to predict, therefore let’s try to make some modifications.

custom.model = adr ~ reserved_room_type_A + children + `market_segment_Online TA` + adults + 
  assigned_room_type_A + market_segment_Groups + assigned_room_type_G + customer_type_Transient + assigned_room_type_F + 
  total_of_special_requests + reserved_room_type_D + market_segment_Complementary + distribution_channel_Corporate

model.custom.lr = runCaret("lm", custom.model,
                    hotel.train, test = hotel.test)
## [1] "lm  used"
## Linear Regression 
## 
## 77281 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 61826, 61824, 61824, 61825, 61825, 61824, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   38.54242  0.3166468  28.64607
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is  0.286403596991798"

We are getting the same results as we got with forward regression. If we add more variables, the R² increases. Therefore for this example let’s use all the variables to predict adr. From a statistical view, it is pretty bad approach as we are adding so much noise to the model (overfitting). However, when we try to predict each variable, we see that it is the best way for this data set. The main reason is that this data set was prepared to classify customers if they were going to cancel their reservation or not, not to predict the average price per night, as this factor depends on so many things that are not expressed in the data set.

Ridge regression

What about ridge regression. Is the penalization of higher dimension better so we do not end up with 51 variables in our model? Let’s see.

#ridge_grid = expand.grid(lambda = seq(0, .1, length = 100))

#model.ridge = runCaret("ridge", adr ~ ., hotel.train, test = hotel.test, grid = ridge_grid)
Something is wrong; all the RMSE metric values are missing:
      RMSE        Rsquared        MAE     
 Min.   : NA   Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA   Max.   : NA  
 NA's   :100   NA's   :100   NA's   :100  
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :

It is a shame but after trying multiple ways ridge regression, it did not work. The main problem, I think is beacuse of huge amount of dummy variables that this data set has, that if in a cross validation it did not get the whole value then it gives this error. I have tried removing, shinking the data, column wise or row wise but nothing.

What about not creating any dummy variables and let then caret take care of it.

#rm(list = ls())

#load("bin/data_partition_regression_ready_feature_no_dummy.RData")
#ridge_grid = expand.grid(lambda = seq(0, .1, length = 10))

#model.ridge = runCaret("ridge", adr ~ ., hotel.train, test = hotel.test, grid = ridge_grid)

And nothing, it keeps breaking so let’s skip it. It is a shame because ridge regression was really promising to reduce the dimensionality of our best model as it penalizes high dimension models. However, due to the enormous data set we are not able to perform it.

Lasso

#lasso_grid = expand.grid(fraction = seq(0, .1, length = 100))

#model.lasso = runCaret("lasso", adr ~ ., hotel.train[1:10000, ], test = hotel.test, grid = lasso_grid)

Same as ridge regression, as lasso is pretty similar, just not doing the squared normed of the beta for the penalization part, it does not work either. So let’s see if Elastic Networks works for us.

Elastic Net

#elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))

#model.elastic = runCaret("glmnet", adr ~ ., hotel.train, test = hotel.test, grid = elastic_grid)

load("bin/elastic.RData")

After computing the model we see that R² is just 0.4, same as the first model that we had.

plot(model.elastic[[1]])

model.elastic[[1]]$bestTune
##     alpha lambda
## 176  0.15    0.1

And this makes our hypothesis stronger. It looks like the best mixing % is 0, that is the model needs as many variables as possible to improve its R². Also all the regularization parameters are the same, and that is really interesting (all the lambdas). That is that this data set is not really well suited to predict the adr variable, but let’s see if we can draw some conclusions out of this model.

plot(varImp(model.elastic[[1]], scale = F), scales = list(y = list(cex = .95)))

For this model the most important variables are the type of rooms a customer is assigned or reserved. This is really interesting as we saw that type A rooms were the cheapest ones but now lets see this ones.

coef(model.elastic[[1]]$finalModel)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 's0', 's1', 's2' ... ]]
##                                                                        
## (Intercept)                    104.3174 103.7061 103.1868088 102.710528
## lead_time                        .        .        .           .       
## arrival_date_year                .        .        .           .       
## stays_in_weekend_nights          .        .        .           .       
## stays_in_week_nights             .        .        .           .       
## adults                           .        .        .           .       
## children                         .        .        0.1590313   1.073439
## babies                           .        .        .           .       
## previous_cancellations           .        .        .           .       
## previous_bookings_not_canceled   .        .        .           .       
##                                                                          
## (Intercept)                    101.997133 101.189070 100.401587 99.637521
## lead_time                        .          .          .         .       
## arrival_date_year                .          .          .         .       
## stays_in_weekend_nights          .          .          .         .       
## stays_in_week_nights             .          .          .         .       
## adults                           .          .          .         .       
## children                         1.957537   2.815247   3.653877  4.469494
## babies                           .          .          .         .       
## previous_cancellations           .          .          .         .       
## previous_bookings_not_canceled   .          .          .         .       
##                                                         
## (Intercept)                    -349.2796720 -1351.551356
## lead_time                         .             .       
## arrival_date_year                 0.2222897     0.719040
## stays_in_weekend_nights           .             .       
## stays_in_week_nights              .             .       
## adults                            .             .       
## children                          5.2541012     5.913572
## babies                            .             .       
## previous_cancellations            .             .       
## previous_bookings_not_canceled    .             .

This does not give us any meaningful data because of the insane amount of predictors that we have, therefore let’s use our simple model.

qplot(model.elastic[[2]], hotel.test[, 1]) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 300), y = c(0, 300)) +
  geom_abline(intercept = 0, slope = 1, colour = "blue")
## Warning: Removed 316 rows containing missing values (geom_point).

Woh! Here we see some improvemnt in the overfitting. It is true that the models keep trying to predict less than it really is. That is the models under price on average. However, there is less overfitting.

elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))

model.elastic.simple = runCaret("glmnet", custom.model, hotel.train, test = hotel.test, grid = elastic_grid)
## [1] "glmnet  used"
## glmnet 
## 
## 77281 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 61825, 61825, 61824, 61826, 61824, 61824, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  RMSE      Rsquared   MAE     
##   0.00   0.00    38.58005  0.3155642  28.64729
##   0.00   0.01    38.58005  0.3155642  28.64729
##   0.00   0.02    38.58005  0.3155642  28.64729
##   0.00   0.03    38.58005  0.3155642  28.64729
##   0.00   0.04    38.58005  0.3155642  28.64729
##   0.00   0.05    38.58005  0.3155642  28.64729
##   0.00   0.06    38.58005  0.3155642  28.64729
##   0.00   0.07    38.58005  0.3155642  28.64729
##   0.00   0.08    38.58005  0.3155642  28.64729
##   0.00   0.09    38.58005  0.3155642  28.64729
##   0.00   0.10    38.58005  0.3155642  28.64729
##   0.01   0.00    38.56527  0.3159801  28.64774
##   0.01   0.01    38.56527  0.3159801  28.64774
##   0.01   0.02    38.56527  0.3159801  28.64774
##   0.01   0.03    38.56527  0.3159801  28.64774
##   0.01   0.04    38.56527  0.3159801  28.64774
##   0.01   0.05    38.56527  0.3159801  28.64774
##   0.01   0.06    38.56527  0.3159801  28.64774
##   0.01   0.07    38.56527  0.3159801  28.64774
##   0.01   0.08    38.56527  0.3159801  28.64774
##   0.01   0.09    38.56527  0.3159801  28.64774
##   0.01   0.10    38.56527  0.3159801  28.64774
##   0.02   0.00    38.56530  0.3159783  28.64772
##   0.02   0.01    38.56530  0.3159783  28.64772
##   0.02   0.02    38.56530  0.3159783  28.64772
##   0.02   0.03    38.56530  0.3159783  28.64772
##   0.02   0.04    38.56530  0.3159783  28.64772
##   0.02   0.05    38.56530  0.3159783  28.64772
##   0.02   0.06    38.56530  0.3159783  28.64772
##   0.02   0.07    38.56530  0.3159783  28.64772
##   0.02   0.08    38.56530  0.3159783  28.64772
##   0.02   0.09    38.56530  0.3159783  28.64772
##   0.02   0.10    38.56530  0.3159783  28.64772
##   0.03   0.00    38.56532  0.3159777  28.64769
##   0.03   0.01    38.56532  0.3159777  28.64769
##   0.03   0.02    38.56532  0.3159777  28.64769
##   0.03   0.03    38.56532  0.3159777  28.64769
##   0.03   0.04    38.56532  0.3159777  28.64769
##   0.03   0.05    38.56532  0.3159777  28.64769
##   0.03   0.06    38.56532  0.3159777  28.64769
##   0.03   0.07    38.56532  0.3159777  28.64769
##   0.03   0.08    38.56532  0.3159777  28.64769
##   0.03   0.09    38.56532  0.3159777  28.64769
##   0.03   0.10    38.56532  0.3159777  28.64769
##   0.04   0.00    38.56535  0.3159786  28.64763
##   0.04   0.01    38.56535  0.3159786  28.64763
##   0.04   0.02    38.56535  0.3159786  28.64763
##   0.04   0.03    38.56535  0.3159786  28.64763
##   0.04   0.04    38.56535  0.3159786  28.64763
##   0.04   0.05    38.56535  0.3159786  28.64763
##   0.04   0.06    38.56535  0.3159786  28.64763
##   0.04   0.07    38.56535  0.3159786  28.64763
##   0.04   0.08    38.56535  0.3159786  28.64763
##   0.04   0.09    38.56535  0.3159786  28.64763
##   0.04   0.10    38.56535  0.3159786  28.64763
##   0.05   0.00    38.56536  0.3159782  28.64762
##   0.05   0.01    38.56536  0.3159782  28.64762
##   0.05   0.02    38.56536  0.3159782  28.64762
##   0.05   0.03    38.56536  0.3159782  28.64762
##   0.05   0.04    38.56536  0.3159782  28.64762
##   0.05   0.05    38.56536  0.3159782  28.64762
##   0.05   0.06    38.56536  0.3159782  28.64762
##   0.05   0.07    38.56536  0.3159782  28.64762
##   0.05   0.08    38.56536  0.3159782  28.64762
##   0.05   0.09    38.56536  0.3159782  28.64762
##   0.05   0.10    38.56536  0.3159782  28.64762
##   0.06   0.00    38.56540  0.3159766  28.64755
##   0.06   0.01    38.56540  0.3159766  28.64755
##   0.06   0.02    38.56540  0.3159766  28.64755
##   0.06   0.03    38.56540  0.3159766  28.64755
##   0.06   0.04    38.56540  0.3159766  28.64755
##   0.06   0.05    38.56540  0.3159766  28.64755
##   0.06   0.06    38.56540  0.3159766  28.64755
##   0.06   0.07    38.56540  0.3159766  28.64755
##   0.06   0.08    38.56540  0.3159766  28.64755
##   0.06   0.09    38.56540  0.3159766  28.64755
##   0.06   0.10    38.56540  0.3159766  28.64755
##   0.07   0.00    38.56538  0.3159773  28.64757
##   0.07   0.01    38.56538  0.3159773  28.64757
##   0.07   0.02    38.56538  0.3159773  28.64757
##   0.07   0.03    38.56538  0.3159773  28.64757
##   0.07   0.04    38.56538  0.3159773  28.64757
##   0.07   0.05    38.56538  0.3159773  28.64757
##   0.07   0.06    38.56538  0.3159773  28.64757
##   0.07   0.07    38.56538  0.3159773  28.64757
##   0.07   0.08    38.56538  0.3159773  28.64757
##   0.07   0.09    38.56538  0.3159773  28.64757
##   0.07   0.10    38.56538  0.3159773  28.64757
##   0.08   0.00    38.56538  0.3159749  28.64753
##   0.08   0.01    38.56538  0.3159749  28.64753
##   0.08   0.02    38.56538  0.3159749  28.64753
##   0.08   0.03    38.56538  0.3159749  28.64753
##   0.08   0.04    38.56538  0.3159749  28.64753
##   0.08   0.05    38.56538  0.3159749  28.64753
##   0.08   0.06    38.56538  0.3159749  28.64753
##   0.08   0.07    38.56538  0.3159749  28.64753
##   0.08   0.08    38.56538  0.3159749  28.64753
##   0.08   0.09    38.56538  0.3159749  28.64753
##   0.08   0.10    38.56538  0.3159749  28.64753
##   0.09   0.00    38.56538  0.3159745  28.64754
##   0.09   0.01    38.56538  0.3159745  28.64754
##   0.09   0.02    38.56538  0.3159745  28.64754
##   0.09   0.03    38.56538  0.3159745  28.64754
##   0.09   0.04    38.56538  0.3159745  28.64754
##   0.09   0.05    38.56538  0.3159745  28.64754
##   0.09   0.06    38.56538  0.3159745  28.64754
##   0.09   0.07    38.56538  0.3159745  28.64754
##   0.09   0.08    38.56538  0.3159745  28.64754
##   0.09   0.09    38.56538  0.3159745  28.64754
##   0.09   0.10    38.56538  0.3159745  28.64754
##   0.10   0.00    38.56540  0.3159763  28.64751
##   0.10   0.01    38.56540  0.3159763  28.64751
##   0.10   0.02    38.56540  0.3159763  28.64751
##   0.10   0.03    38.56540  0.3159763  28.64751
##   0.10   0.04    38.56540  0.3159763  28.64751
##   0.10   0.05    38.56540  0.3159763  28.64751
##   0.10   0.06    38.56540  0.3159763  28.64751
##   0.10   0.07    38.56540  0.3159763  28.64751
##   0.10   0.08    38.56540  0.3159763  28.64751
##   0.10   0.09    38.56540  0.3159763  28.64751
##   0.10   0.10    38.56540  0.3159763  28.64751
##   0.11   0.00    38.56544  0.3159751  28.64741
##   0.11   0.01    38.56544  0.3159751  28.64741
##   0.11   0.02    38.56544  0.3159751  28.64741
##   0.11   0.03    38.56544  0.3159751  28.64741
##   0.11   0.04    38.56544  0.3159751  28.64741
##   0.11   0.05    38.56544  0.3159751  28.64741
##   0.11   0.06    38.56544  0.3159751  28.64741
##   0.11   0.07    38.56544  0.3159751  28.64741
##   0.11   0.08    38.56544  0.3159751  28.64741
##   0.11   0.09    38.56544  0.3159751  28.64741
##   0.11   0.10    38.56544  0.3159751  28.64741
##   0.12   0.00    38.56538  0.3159737  28.64763
##   0.12   0.01    38.56538  0.3159737  28.64763
##   0.12   0.02    38.56538  0.3159737  28.64763
##   0.12   0.03    38.56538  0.3159737  28.64763
##   0.12   0.04    38.56538  0.3159737  28.64763
##   0.12   0.05    38.56538  0.3159737  28.64763
##   0.12   0.06    38.56538  0.3159737  28.64763
##   0.12   0.07    38.56538  0.3159737  28.64763
##   0.12   0.08    38.56538  0.3159737  28.64763
##   0.12   0.09    38.56538  0.3159737  28.64763
##   0.12   0.10    38.56538  0.3159737  28.64763
##   0.13   0.00    38.56540  0.3159757  28.64749
##   0.13   0.01    38.56540  0.3159757  28.64749
##   0.13   0.02    38.56540  0.3159757  28.64749
##   0.13   0.03    38.56540  0.3159757  28.64749
##   0.13   0.04    38.56540  0.3159757  28.64749
##   0.13   0.05    38.56540  0.3159757  28.64749
##   0.13   0.06    38.56540  0.3159757  28.64749
##   0.13   0.07    38.56540  0.3159757  28.64749
##   0.13   0.08    38.56540  0.3159757  28.64749
##   0.13   0.09    38.56540  0.3159757  28.64749
##   0.13   0.10    38.56540  0.3159757  28.64749
##   0.14   0.00    38.56541  0.3159781  28.64758
##   0.14   0.01    38.56541  0.3159781  28.64758
##   0.14   0.02    38.56541  0.3159781  28.64758
##   0.14   0.03    38.56541  0.3159781  28.64758
##   0.14   0.04    38.56541  0.3159781  28.64758
##   0.14   0.05    38.56541  0.3159781  28.64758
##   0.14   0.06    38.56541  0.3159781  28.64758
##   0.14   0.07    38.56541  0.3159781  28.64758
##   0.14   0.08    38.56541  0.3159781  28.64758
##   0.14   0.09    38.56541  0.3159781  28.64758
##   0.14   0.10    38.56541  0.3159781  28.64758
##   0.15   0.00    38.56544  0.3159739  28.64754
##   0.15   0.01    38.56544  0.3159739  28.64754
##   0.15   0.02    38.56544  0.3159739  28.64754
##   0.15   0.03    38.56544  0.3159739  28.64754
##   0.15   0.04    38.56544  0.3159739  28.64754
##   0.15   0.05    38.56544  0.3159739  28.64754
##   0.15   0.06    38.56544  0.3159739  28.64754
##   0.15   0.07    38.56544  0.3159739  28.64754
##   0.15   0.08    38.56544  0.3159739  28.64754
##   0.15   0.09    38.56544  0.3159739  28.64754
##   0.15   0.10    38.56544  0.3159739  28.64754
##   0.16   0.00    38.56545  0.3159769  28.64753
##   0.16   0.01    38.56545  0.3159769  28.64753
##   0.16   0.02    38.56545  0.3159769  28.64753
##   0.16   0.03    38.56545  0.3159769  28.64753
##   0.16   0.04    38.56545  0.3159769  28.64753
##   0.16   0.05    38.56545  0.3159769  28.64753
##   0.16   0.06    38.56545  0.3159769  28.64753
##   0.16   0.07    38.56545  0.3159769  28.64753
##   0.16   0.08    38.56545  0.3159769  28.64753
##   0.16   0.09    38.56545  0.3159769  28.64753
##   0.16   0.10    38.56545  0.3159769  28.64753
##   0.17   0.00    38.56541  0.3159730  28.64773
##   0.17   0.01    38.56541  0.3159730  28.64773
##   0.17   0.02    38.56541  0.3159730  28.64773
##   0.17   0.03    38.56541  0.3159730  28.64773
##   0.17   0.04    38.56541  0.3159730  28.64773
##   0.17   0.05    38.56541  0.3159730  28.64773
##   0.17   0.06    38.56541  0.3159730  28.64773
##   0.17   0.07    38.56541  0.3159730  28.64773
##   0.17   0.08    38.56541  0.3159730  28.64773
##   0.17   0.09    38.56541  0.3159730  28.64773
##   0.17   0.10    38.56541  0.3159730  28.64773
##   0.18   0.00    38.56543  0.3159717  28.64745
##   0.18   0.01    38.56543  0.3159717  28.64745
##   0.18   0.02    38.56543  0.3159717  28.64745
##   0.18   0.03    38.56543  0.3159717  28.64745
##   0.18   0.04    38.56543  0.3159717  28.64745
##   0.18   0.05    38.56543  0.3159717  28.64745
##   0.18   0.06    38.56543  0.3159717  28.64745
##   0.18   0.07    38.56543  0.3159717  28.64745
##   0.18   0.08    38.56543  0.3159717  28.64745
##   0.18   0.09    38.56543  0.3159717  28.64745
##   0.18   0.10    38.56543  0.3159717  28.64745
##   0.19   0.00    38.56542  0.3159769  28.64758
##   0.19   0.01    38.56542  0.3159769  28.64758
##   0.19   0.02    38.56542  0.3159769  28.64758
##   0.19   0.03    38.56542  0.3159769  28.64758
##   0.19   0.04    38.56542  0.3159769  28.64758
##   0.19   0.05    38.56542  0.3159769  28.64758
##   0.19   0.06    38.56542  0.3159769  28.64758
##   0.19   0.07    38.56542  0.3159769  28.64758
##   0.19   0.08    38.56542  0.3159769  28.64758
##   0.19   0.09    38.56542  0.3159769  28.64758
##   0.19   0.10    38.56542  0.3159769  28.64758
##   0.20   0.00    38.56541  0.3159727  28.64773
##   0.20   0.01    38.56541  0.3159727  28.64773
##   0.20   0.02    38.56541  0.3159727  28.64773
##   0.20   0.03    38.56541  0.3159727  28.64773
##   0.20   0.04    38.56541  0.3159727  28.64773
##   0.20   0.05    38.56541  0.3159727  28.64773
##   0.20   0.06    38.56541  0.3159727  28.64773
##   0.20   0.07    38.56541  0.3159727  28.64773
##   0.20   0.08    38.56541  0.3159727  28.64773
##   0.20   0.09    38.56541  0.3159727  28.64773
##   0.20   0.10    38.56541  0.3159727  28.64773
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.01 and lambda = 0.1.
## [1] "R^2 of the testing is  0.286151472151666"

Oh this is really terrible. This model is 25% worse than the previous, however let’s see if we can draw any assumptions.

plot(varImp(model.elastic.simple[[1]], scale = F), scales = list(y = list(cex = .95)))

Here we see that again, elastic net focuses more on the room type and market segment. So let’s take a look at the coefficients.

coef(model.elastic.simple[[1]]$finalModel)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 's0', 's1', 's2' ... ]]
##                                                                          
## (Intercept)                104.3174 104.38957956 104.45922206 104.5195607
## reserved_room_type_A         .       -0.09977738  -0.20808248  -0.3259203
## children                     .        .            0.08323533   0.2008870
## `market_segment_Online TA`   .        .            .            0.0252787
## adults                       .        .            .            .        
## assigned_room_type_A         .        .            .            .        
## market_segment_Groups        .        .            .            .        
## assigned_room_type_G         .        .            .            .        
## customer_type_Transient      .        .            .            .        
## assigned_room_type_F         .        .            .            .        
##                                                                              
## (Intercept)                104.5538204 104.61556715 104.57760742 104.53220829
## reserved_room_type_A        -0.4533059  -0.59063064  -0.73702093  -0.89511598
## children                     0.3282879   0.46646857   0.61583106   0.77768863
## `market_segment_Online TA`   0.1147508   0.21183976   0.31645884   0.42979645
## adults                       .           .            0.07044053   0.14734262
## assigned_room_type_A         .          -0.04015922  -0.12753553  -0.22130944
## market_segment_Groups        .           .            .           -0.01195244
## assigned_room_type_G         .           .            .            0.09595685
## customer_type_Transient      .           .            .            .         
## assigned_room_type_F         .           .            .            0.06913068
##                                                    
## (Intercept)                104.4827388 104.33217943
## reserved_room_type_A        -1.0619217  -1.23758418
## children                     0.9474354   1.13085906
## `market_segment_Online TA`   0.5500779   0.67609141
## adults                       0.2299825   0.31786653
## assigned_room_type_A        -0.3185852  -0.41980711
## market_segment_Groups       -0.1272397  -0.24698813
## assigned_room_type_G         0.4000957   0.72689892
## customer_type_Transient      .           0.07322347
## assigned_room_type_F         0.3096616   0.56764520

Here we see the following:

  • Market_segment_Complementary (strongly negative): we see than on average the market segment average night price is 25 price units less than the average. This can be because this segment goes to the room_type_A (cheaper as we saw) or it has some special deal with the hotel.

  • Reserved_room_type_A (negative): on average this type of room is 5 unit price less than the averse, as we knew.

  • Children (positive): as we saw children are more expensive than adults.

Conclusions

So the conclusions that we got to understand the price of a specific customer are:

  • The room type A is the cheapest.

  • Children tend to pay more on average.

  • The market segment complementary just books all the type A rooms or as some special deals with the hotel. That is becuase is the segment with the lowest average adr.

Now, will the graphs say the same as our conclusions?

ggplot(hotel.train) + aes(x = as.factor(assigned_room_type_A), y = adr) + geom_boxplot()

Awesome! Our first conclusion was correct, and we see that on average room type A are cheaper than the others.

ggplot(hotel.train) + aes(x = as.factor(children), y = adr) + geom_boxplot()

And also correct, the more children in a reservation the higher the price.

ggplot(hotel.train) + aes(x = as.factor(market_segment_Complementary), y = adr) + geom_boxplot()

And woh! that is a really big difference. We could further study why this segment has less adr on average, but that is out of scope. Now let’s try to predict better the adr variable (now we are doing a little bit poorly).

NOTE: if we use statistical tools we can see that this variables are the most important to predict the variable adr, however as is later shown; if we fit non-linear models (more flexible ones) we see that the number of people is the most important variable and also the room type A. If we use linear model they do not show that clearly because their relationship is not linear therefore is not noticible with statistical tools. However we could try to compute the second order polynomial or more to see if that is really the case. But let’s now try to predict.

Advanced Tools for Regression

KNN

For KNN we will standardized the data, because otherwise the year variable and other with high number will distrupt the small ones. It is always a good idea to standardized the data in KNN.

# knn.reg_grid = data.frame(kmax=c(11,13,15,19,21),distance=2, kernel='optimal')

# model.knn.reg = runCaret("kknn", adr ~ ., hotel.train, test = hotel.test, grid = knn.reg_grid, preProcess = c("center", "scale"))

load("bin/kknn.RData")

model.knn.reg[[3]]
## [1] 0.5288361

It is impressive that knn gives an R² of 0.53. It is a huge leap from what we where.

model.knn.reg[[1]]
## k-Nearest Neighbors 
## 
## 77281 samples
##    60 predictor
## 
## Pre-processing: centered (60), scaled (60) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 61825, 61825, 61824, 61825, 61825, 61825, ... 
## Resampling results across tuning parameters:
## 
##   kmax  RMSE      Rsquared   MAE     
##   11    31.45667  0.5479302  19.61445
##   13    31.29816  0.5512924  19.66235
##   15    31.20029  0.5533680  19.72326
##   19    31.10693  0.5552679  19.86855
##   21    31.10041  0.5553931  19.88979
## 
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
##  parameter 'kernel' was held constant at a value of optimal
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 21, distance = 2 and kernel
##  = optimal.

Here we see that more neighbors that we choose the better. This is the same thing with our previous assumption that the more columns that we got the better. And our hypothesis is further proven, this data set is not suited to predict the adr and that is the reason why it needs so many neighbors. Also, we can derive from this, that due to the huge data set that we have, we need a lot of neighbours to predict more accurately. However, this k was the max that our computer could handle, otherwise it just kept breaking. But, this is the best model so far.

plot(model.knn.reg[[1]])

This graph shows what we had said previously.

plot(varImp(model.knn.reg[[1]], scale = T), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

This are some really impressive results. It looks like the variable that we created, is the most important. So we could add to our hypothesis that the number of people for each reservation is highly correlated with the price on average. Also, as stated before, the room type A is pretty important, and the number of children.

Random Forest

#rf_grid = data.frame(mtry = seq(1, 15, 3))

#model.rf = runCaret("rf", adr ~ ., hotel.train, test = hotel.test, grid = rf_grid)

After multiple attempts trying to make it work, we have concluded that this data set is too complex for our computer to handle random forest,, therefore we will leave it for now.

Gradient Boosting

# gradient_grid = expand.grid(nrounds = c(500,1000), max_depth = c(5,6,7), eta = c(0.01, 0.1, 1),
#                                          gamma = c(1, 2, 3), colsample_bytree = c(1, 2),
#                                         min_child_weight = c(1), subsample = c(0.2,0.5,0.8))))

#model.xgb = runCaret("xgbTree", adr ~ ., hotel.train, test = hotel.test, grid = gradient_grid)

Same as the Random Forest.

SVM

# svm_grid = expand.grid(cost = c(1, 2, 3))

# model.svm = runCaret("svmLinear2", adr ~ ., hotel.train, grid = svm_grid , test = hotel.test)

load("bin/svm_reg.RData")
model.svm[[1]]
## Support Vector Machines with Linear Kernel 
## 
## 77281 samples
##    60 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 61825, 61825, 61825, 61825, 61824, 61825, ... 
## Resampling results across tuning parameters:
## 
##   cost  RMSE      Rsquared   MAE     
##   1     35.40016  0.4341135  24.26441
##   2     35.36536  0.4343760  24.32515
##   3     35.24342  0.4363186  24.26424
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 3.
model.svm[[3]]
## [1] 0.3919574

Here we can see that the R² is just 0.39, that is a little bit disappointing however, let’s see the most important variables

plot(varImp(model.svm[[1]], scale = T), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

And this is impressive! The most important variables are the same as knn, therefore we can conclude that we can add this variables to our hypothesis.

Other Factors

It is true that most of the models that we tried to use are not very good. The main reason is that this data set was not design to predict the average price per night of a hotel, however we have managed to get pretty good results with knn (as far as we could). We could have better results with other models such as random forest or gradient boosting, but we had computational constrains. We tried to reduce the dimensionality of the data set and also the number of rows, however we where unsuccessful. Now let’s try to fit the best three models and see if we can get better results.

Conclusion (Ensamble)

Now, to make our final prediction, we will be using the best three models that in our case were: KNN, Elastic Net, and Forward Regression.

We are using the best three models to predict our final data set.

For this, let’s import the three models

rm(list = ls())

load("bin/data_partition_regression_ready_feature.RData")

load("bin/fwdr.RData")
load("bin/kknn.RData")
load("bin/elastic.RData")# to create

We are going to use the mean of the models.

finalPrediction = function(data) {
  pred.fwdr = predict(model.fwdr[[1]], newdata = data)
  pred.knn = predict(model.knn.reg[[1]], newdata = data)
  pred.elastic = predict(model.elastic[[1]], newdata = data)
  
  finalPred = apply(data.frame(pred.fwdr, pred.knn, pred.elastic), 1, mean) 
  return(finalPred)
}

Here we construct the function that will take care of predicting at each model and making the mode. So, now that that’s done, let’s use the final test data set and see how well can we predict the data.

pred.final = finalPrediction(hotel.test.final[, -1])
  
cor(hotel.test.final[, 1], pred.final)^2
## [1] 0.5539924

And as we saw with classification, doing the ensemble improves by a lot the prediction. It increased 10% the R². Now, let’s see our errors.

Final Predictions

ggplot(as.data.frame(pred.final)) + aes(x = pred.final) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks pretty much like a normal slightly shifted to the left. We see that there are some predictions that are negative that we could round up to zero. And the most likely hypothesis is due to those observations where adr was 0.

So let’s try to compute some prediction intervals so we can assess a customer if the price that they are given is under priced, well priced, or over priced.

Prediction Intervals

what about the errors?

error = hotel.test.final[, 1] - pred.final

ggplot(as.data.frame(error)) + aes(x = error) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is pretty good. We see that the error of our models is centered at 0 and follows more or less a normal distribution. That is pretty good for the prediction intervals as this will make our predictions more accurate for the lower part, but they will be higher for the upper part.

In order to make the prediction intervals, we will have to use half of the data to get the intervals, otherwise we will overfit the prediction and we won’t make the that accurate. We will use the other half to compute the lower and upper limits.

noise = error[1:(length(error) / 2)]

Because the right part is pretty large, we will compute the 80% confidence interval.

pred.interval.data = pred.final[((length(pred.final) / 2) + 1):length(pred.final)]

lwr = pred.interval.data - abs(quantile(noise,0.1, na.rm=T))
upr = pred.interval.data + quantile(noise,0.9, na.rm=T)
pred.interval.df = data.frame(real= hotel.test.final[((nrow(hotel.test.final) / 2) + 1):nrow(hotel.test.final), 1], 
                              fit= pred.interval.data, lwr=lwr, upr=upr)

pred.interval.df = pred.interval.df %>% mutate(out=factor(if_else((real<lwr | real>upr),1,0)))

mean(pred.interval.df$out==1)
## [1] 0.06117185

So here 0.06% of customers were over priced or under priced according to our model.

ggplot(pred.interval.df, aes(x=fit, y=real))+
  geom_point(aes(color=out)) + theme(legend.position="none") +
  xlim(0, 300) + ylim(0, 300)+
  geom_ribbon(data=pred.interval.df,aes(ymin=lwr,ymax=upr),alpha=0.3) +
  labs(title = "Prediction intervals", x = "prediction",y="real price")
## Warning: Removed 83 rows containing missing values (geom_point).

Here we see that our ensamble model tends to under price the real adr, however it is pretty good. For this we can assess each customer if the are paying more or less than they should. Finally, we can see some zeros values on the real price axis and those are for sure promotions that the hotel runs or error that had so they did not charge reservations for those customers. We would need to add another variable according to this statement as said before to predict those.

Final Conclusion

After predicting the qualitative variable and the qualitative one, we have clearly seen that this data set is well fit for classification and not for regression. However, all in all we achieved some pretty good results.

Furthermore, we got some pretty god insights on the data and we could assess the hotels to improve their cancellation ratio and tell the customer how much a night should cost.